Décodeur - Smartmontools

Bonjour,

Je suppose que ma question ne rentre pas en compte dans le support debian, alors je la lance ici.
Je suspecte l’un de mes disques d’être proche de l’agonie.

Plusieurs tests ont été lancés via smartmontools : les courts ont tous fonctionné sans rien montrer de sérieux, mais les longs ont tous planté et je ne comprends pas pourquoi.

Voici probablement le retour le plus utile pour comprendre la situation :

[code]# smartctl -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [i686-linux-3.2.0-4-686-pae] (local build)
Copyright © 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG HD103UI
Serial Number: S1LMJ9DS400339
LU WWN Device Id: 5 0024e9 2006eda7f
Firmware Version: 1AA01113
User Capacity: 1 000 204 886 016 bytes [1,00 TB]
Sector Size: 512 bytes logical/physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 3b
Local Time is: Sun Sep 1 11:47:36 2013 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 35) The self-test routine was interrupted
by the host with a hard or soft reset.
Total time to complete Offline
data collection: (15559) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 27) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0007 074 074 011 Pre-fail Always - 8750
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 48
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 10133
9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 35954
10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 47
13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0033 001 001 000 Pre-fail Always - 150
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 072 066 000 Old_age Always - 28 (Min/Max 26/30)
194 Temperature_Celsius 0x0022 072 064 000 Old_age Always - 28 (Min/Max 26/31)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 4323090
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 26563
200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 2
201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0

SMART Error Log Version: 1
ATA Error Count: 594 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It “wraps” after 49.710 days.

Error 594 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


84 51 1f b8 91 99 e0 Error: ICRC, ABRT 31 sectors at LBA = 0x009991b8 = 10064312

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


c8 00 00 d7 90 99 e0 08 3d+18:32:03.930 READ DMA
c8 00 08 97 82 9b e0 08 3d+18:32:03.920 READ DMA
c8 00 08 b7 66 9b e0 08 3d+18:32:03.920 READ DMA
c8 00 10 97 66 9b e0 08 3d+18:32:03.910 READ DMA
c8 00 10 8f c8 95 e0 08 3d+18:32:03.880 READ DMA

Error 593 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


84 51 3f 38 53 ad e1 Error: ICRC, ABRT 63 sectors at LBA = 0x01ad5338 = 28136248

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


c8 00 00 77 52 ad e1 08 3d+18:31:56.620 READ DMA
c8 00 00 77 65 6d e1 08 3d+18:31:56.600 READ DMA
c8 00 00 57 5b 6d e1 08 3d+18:31:56.600 READ DMA
c8 00 00 e7 50 ad e1 08 3d+18:31:56.600 READ DMA
c8 00 40 f7 59 6d e1 08 3d+18:31:56.600 READ DMA

Error 592 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


84 51 1f 28 df 73 e0 Error: ICRC, ABRT 31 sectors at LBA = 0x0073df28 = 7593768

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


c8 00 30 17 df 73 e0 08 3d+18:31:55.550 READ DMA
c8 00 08 a7 13 5b e0 08 3d+18:31:55.540 READ DMA
c8 00 08 07 47 5c e0 08 3d+18:31:55.540 READ DMA
c8 00 08 ff 46 5c e0 08 3d+18:31:55.530 READ DMA
c8 00 08 ff c1 99 e0 08 3d+18:31:55.520 READ DMA

Error 591 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


84 51 1f 38 b7 73 e0 Error: ICRC, ABRT 31 sectors at LBA = 0x0073b738 = 7583544

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


c8 00 00 57 b6 73 e0 08 3d+18:31:43.690 READ DMA
c8 00 08 2f 02 94 e0 08 3d+18:31:43.690 READ DMA
c8 00 08 07 02 94 e0 08 3d+18:31:43.690 READ DMA
c8 00 08 37 02 94 e0 08 3d+18:31:43.690 READ DMA
c8 00 08 37 0d 97 e0 08 3d+18:31:43.680 READ DMA

Error 590 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


84 51 2f 70 69 73 e0 Error: ICRC, ABRT 47 sectors at LBA = 0x00736970 = 7563632

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


c8 00 00 9f 68 73 e0 08 3d+18:31:37.670 READ DMA
c8 00 00 9f 67 73 e0 08 3d+18:31:37.550 READ DMA
c8 00 08 f0 29 5c e7 08 3d+18:31:37.530 READ DMA
c8 00 08 d0 29 5c e7 08 3d+18:31:37.530 READ DMA
c8 00 08 b0 29 5c e7 08 3d+18:31:37.530 READ DMA

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Extended offline Interrupted (host reset) 30% 35945 -

2 Short offline Completed without error 00% 35943 -

3 Extended offline Interrupted (host reset) 30% 35893 -

4 Short offline Completed without error 00% 35891 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
[/code]

On voit bien à la fin les quatre tests lancés et les plantages alors qu’il restait 30% à effectuer.
Il y a un aussi un certain nombre d’erreurs dont je ne comprends pas exactement la nature.

Qu’est-ce que vous en pensez ?

Ton nombre énorme d’«Hardware_ECC errors» plus les erreurs ATA/DMA données incitent plus à un souci d’interface: vérifie les cables et la connexion, éventellement change d’interface SATA.

Rq: Je peux me tromper sur l’interprétation du chiffre Hardware_ECC errors, mais pour le type d’erreurs rencontrées, je suis presque sûr que c’est un souci de cable ou d’interface.

[quote=“fran.b”]Ton nombre énorme d’«Hardware_ECC errors» plus les erreurs ATA/DMA données incitent plus à un souci d’interface: vérifie les cables et la connexion, éventellement change d’interface SATA.

Rq: Je peux me tromper sur l’interprétation du chiffre Hardware_ECC errors, mais pour le type d’erreurs rencontrées, je suis presque sûr que c’est un souci de cable ou d’interface.[/quote]

Merci pour la réponse. Si c’est effectivement le cas, ce serait rassurant.
Suite à ta remarque, je suis allé un peu chercher sur le web et il semblerait que cette valeur soit “normale” avec les grands disques. Mais je vais tout de même changer de câble et mettre sur une autre interface puis relancer le test en espérant que le test arrivera à se terminer.

J’ai eu également quelques problèmes avec les interfaces SATA. J’avais l’un ou l’autre câble qui voulait reprendre sa liberté de temps à autres.
J’ai systématiquement remplacé tous les câbles SATA internes par des câbles munis d’un verrouillage à 0.50€/la pièce !

Depuis … plus de problème avec mes DD, graveur CD/DV, lecteur de cartes, … appareils SATA :wink:

Voilà, j’ai changé le câble du disque et branché sur une autre interface SATA.
Naturellement, une carte réseau en a profité pour me lâcher, mais c’est une autre histoire.

Cette fois-ci, le test long a pu se terminer, et apparemment sans erreur :

[code]# smartctl -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [i686-linux-3.2.0-4-686-pae] (local build)
Copyright © 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG HD103UI
Serial Number: S1LMJ9DS400339
LU WWN Device Id: 5 0024e9 2006eda7f
Firmware Version: 1AA01113
User Capacity: 1 000 204 886 016 bytes [1,00 TB]
Sector Size: 512 bytes logical/physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 3b
Local Time is: Wed Sep 4 20:22:09 2013 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (15559) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 27) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0007 073 073 011 Pre-fail Always - 8770
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 51
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 10214
9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 36034
10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 50
13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0033 001 001 000 Pre-fail Always - 150
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 071 066 000 Old_age Always - 29 (Min/Max 28/32)
194 Temperature_Celsius 0x0022 071 064 000 Old_age Always - 29 (Min/Max 28/33)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 457164
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 30256
200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 2
201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0

SMART Error Log Version: 1
ATA Error Count: 643 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It “wraps” after 49.710 days.

Error 643 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


84 51 00 76 02 7c e1 Error: ICRC, ABRT at LBA = 0x017c0276 = 24904310

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


c8 00 08 6f 02 7c e1 08 6d+18:35:54.430 READ DMA
c8 00 08 47 49 7d e1 08 6d+18:35:54.430 READ DMA
c8 00 08 97 02 7c e1 08 6d+18:35:54.430 READ DMA
c8 00 08 27 4b 7d e1 08 6d+18:35:54.420 READ DMA
c8 00 08 c7 45 7d e1 08 6d+18:35:54.410 READ DMA

Error 642 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


84 51 4f 30 3e 71 e0 Error: ICRC, ABRT 79 sectors at LBA = 0x00713e30 = 7421488

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


c8 00 00 7f 3d 71 e0 08 6d+18:35:44.210 READ DMA
c8 00 00 7f 3c 71 e0 08 6d+18:35:44.110 READ DMA
c8 00 00 7f 3b 71 e0 08 6d+18:35:43.960 READ DMA
c8 00 00 7f 3a 71 e0 08 6d+18:35:43.670 READ DMA
c8 00 00 7f 39 71 e0 08 6d+18:35:43.480 READ DMA

Error 641 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


84 51 00 2e 33 5b e0 Error: ICRC, ABRT at LBA = 0x005b332e = 5976878

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


c8 00 08 27 33 5b e0 08 6d+18:35:10.040 READ DMA
c8 00 08 8f 32 5b e0 08 6d+18:35:10.040 READ DMA
c8 00 18 47 32 5b e0 08 6d+18:35:10.040 READ DMA
c8 00 08 2f 32 5b e0 08 6d+18:35:10.040 READ DMA
c8 00 08 d7 84 5a e0 08 6d+18:35:10.040 READ DMA

Error 640 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


84 51 bf 48 fa 5b e0 Error: ICRC, ABRT 191 sectors at LBA = 0x005bfa48 = 6027848

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


c8 00 00 07 fa 5b e0 08 6d+18:34:48.130 READ DMA
c8 00 00 07 f9 5b e0 08 6d+18:34:48.130 READ DMA
c8 00 d0 37 f8 5b e0 08 6d+18:34:48.130 READ DMA
c8 00 20 0f f8 5b e0 08 6d+18:34:48.130 READ DMA
c8 00 88 7f f7 5b e0 08 6d+18:34:48.120 READ DMA

Error 639 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


84 51 0f c8 e6 7f e0 Error: ICRC, ABRT 15 sectors at LBA = 0x007fe6c8 = 8382152

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


c8 00 00 d7 e5 7f e0 08 6d+18:34:46.800 READ DMA
c8 00 00 d7 e4 7f e0 08 6d+18:34:46.800 READ DMA
c8 00 00 d7 e3 7f e0 08 6d+18:34:46.800 READ DMA
c8 00 00 d7 e2 7f e0 08 6d+18:34:46.800 READ DMA
c8 00 00 d7 e1 7f e0 08 6d+18:34:46.790 READ DMA

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Extended offline Completed without error 00% 36030 -

2 Extended offline Interrupted (host reset) 30% 35945 -

3 Short offline Completed without error 00% 35943 -

4 Extended offline Interrupted (host reset) 30% 35893 -

5 Short offline Completed without error 00% 35891 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
[/code]

Par contre, il y a toujours autant d’erreurs ECC, ainsi que les erreurs précédemment soulignées du type “disk power-on lifetime”. D’après quelques échos ( ayant l’air un peu sérieux ) sur le web, j’ai envie de croire que ce n’est pas grave, surtout à cause de ça :

=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED

Il y en a qd même 10 fois moins… Je t’ai dit que je n’étais pas très sûr de mon interprétation sur ces erreurs ECC

Alors là, ce n’est pas très grave, je suis déjà content d’avoir une piste de recherche et un point particulier à étudier.

D’autant moins que :

Donc, il est très difficile de pouvoir interpréter cette valeur.
Le fabriquant du disque serait donc le seul à pouvoir fournir une interprétation “fiable” de cette valeur.