运维中碰到的一些磁盘问题

早上来到公司,打开邮箱,一堆的dmesg报警信息迎面而来,赶紧登上服务器去看看出了什么问题。

2014-07-18

1.检查dmesg输出信息如下:
demsg|more
EXT3-fs error (device sdg1): ext3_dx_find_entry: bad entry in directory #2: rec_len is smaller than minimal - offset=20480, inode=2553887680, rec_len=0, name_len=0
EXT3-fs error (device sdg1): ext3_add_entry: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=2553887680, rec_len=0, name_len=0
EXT3-fs error (device sdg1): ext3_dx_find_entry: bad entry in directory #2: rec_len is smaller than minimal - offset=24576, inode=2553887680, rec_len=0, name_len=0


解说:一般服务器出问题,从两个方面来考虑,硬件问题和软件问题。从上面的报错来看,像是文件系统出问题了,提示的信息是EXT3-fs error,还有inode节点啥的。

[root@ctc ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             1.9G  1.5G  303M  84% /
/dev/sda7             238G   12G  215G   6% /data0
/dev/sdb1             271G   32G  225G  13% /data1
/dev/sdd1             271G  152G  106G  60% /data3
/dev/sde1             271G  172G   85G  68% /data4
/dev/sdg1             271G  153G  104G  60% /data6
/dev/sdi1             271G  170G   88G  67% /data8
/dev/sdj1             271G  192M  257G   1% /data9
/dev/sda6             3.8G   87M  3.5G   3% /tmp
/dev/sda5             9.5G  2.6G  6.4G  29% /usr
/dev/sda3             9.5G  732M  8.3G   8% /var
tmpfs                 5.9G     0  5.9G   0% /dev/shm
/dev/sdc1             275G  171G   91G  66% /data2
/dev/sdf1             275G  172G   89G  66% /data5
/dev/sdh1             275G  171G   91G  66% /data7

[root@ctc ~]# touch /data6/a.txt
touch: cannot touch `/data6/a.txt': Input/output error
[root@ctc ~]#

可以看到无法新建文件,会提示Input/output error
2.检查硬件,我们的服务器都是dell的,直接用MegaCli64来检查下磁盘的状态,sdg对应的是槽位为6的这个盘,相应的信息如下:

Enclosure Device ID: 32
Slot Number: 6
Device Id: 6
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 286102MB [0x22ecb25c Sectors]
Non Coerced Size: 285590MB [0x22dcb25c Sectors]
Coerced Size: 285568MB [0x22dc0000 Sectors]
Firmware state: Online
SAS Address(0): 0x5000cca018d30e91
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: HITACHI HUS156030VLS600 E516JXYS28ZN
Foreign State: None
Media Type: Hard Disk Device
Device Speed: Unknown
Link Speed: Unknown

首先可以看到磁盘是在线状态的:
Firmware state: Online


然后,可以看到磁盘硬件是没问题的。Media Error代表扇区有问题,坏道等;Other Error代表磁盘可能有松动。
Media Error Count: 0
Other Error Count: 0

最后,如果不放心,可以使用badblocks来检查下磁盘有无坏道,可以看见磁盘没有坏道出现。
[root@ctc ~]# /sbin/badblocks -o sdg.txt -vs /dev/sdg
Checking blocks 0 to 292421632
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found.


3.大致判断出硬件没问题,那就是软件的问题了,软件问题一般就是文件系统问题,使用fsck修复下就可以解决

a.我们线上跑的是缓存的业务,对数据的完整性要求不高,所以我一搬都是停业务后,直接强制卸载磁盘
umount -fl /data6/

b.卸载后可能还会有其他程序在使用这块磁盘
[root@ctc ~]# /sbin/fsck.ext3 -y /dev/sdg1
e2fsck 1.39 (29-May-2006)
/sbin/fsck.ext3: Device or resource busy while trying to open /dev/sdg1
Filesystem mounted or opened exclusively by another program?

c.不管他,再直接干掉在使用这个磁盘的进程
[root@ctc ~]# /sbin/fuser -km /dev/sdg1
/dev/sdg1:           11683 11689

d.终于,可以使用fsck来自动修复了
[root@ctc ~]# /sbin/fsck.ext3 -y /dev/sdg1
e2fsck 1.39 (29-May-2006)
/data6 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
中间的一堆的修复过程就不贴了...
/data6: ***** FILE SYSTEM WAS MODIFIED *****
/data6: 17786/73105408 files (10.8% non-contiguous), 42233686/73103774 blocks


e.修复完成后,挂载使用
[root@ctc ~]# mount -a
[root@ctc ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             1.9G  1.5G  303M  84% /
/dev/sda7             238G   12G  215G   6% /data0
/dev/sdb1             271G   34G  224G  13% /data1
/dev/sdd1             271G  152G  105G  60% /data3
/dev/sde1             271G  173G   85G  68% /data4
/dev/sdi1             271G  170G   87G  67% /data8
/dev/sdj1             271G  192M  257G   1% /data9
/dev/sda6             3.8G   87M  3.5G   3% /tmp
/dev/sda5             9.5G  2.6G  6.4G  29% /usr
/dev/sda3             9.5G  732M  8.3G   8% /var
tmpfs                 5.9G     0  5.9G   0% /dev/shm
/dev/sdc1             275G  171G   90G  66% /data2
/dev/sdf1             275G  173G   89G  67% /data5
/dev/sdh1             275G  171G   90G  66% /data7
/dev/sdg1             271G  153G  104G  60% /data6

PS:刚弄完sdg,sdd又出问题了,这次就不用fsck修了,文件系统问题也可以直接格式化重挂载解决,如下:
[root@ctc ~]# dmesg
EXT3-fs error (device sdd1): ext3_add_entry: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=2553887680, rec_len=0, name_len=0
EXT3-fs error (device sdd1): ext3_dx_find_entry: bad entry in directory #2: rec_len is smaller than minimal - offset=28672, inode=3755, rec_len=0, name_len=0
EXT3-fs error (device sdd1): ext3_add_entry: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=3755, rec_len=0, name_len=0
EXT3-fs error (device sdd1): ext3_dx_find_entry: bad entry in directory #2: rec_len is smaller than minimal - offset=20480, inode=16832, rec_len=8, name_len=0
EXT3-fs error (device sdd1): ext3_add_entry: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=16832, rec_len=8, name_len=0
[root@ctc ~]# umount -fl /data3
[root@ctc ~]# /sbin/fuser -km /dev/sdd1
/dev/sdd1:           11684 11688
[root@ctc ~]# /sbin/mkfs.ext3 /dev/sdd1
[root@ctc ~]# /sbin/e2label /dev/sdd1 /data3
[root@ctc ~]# mount -a
[root@ctc ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             1.9G  1.8G   16M 100% /
/dev/sda7             238G   12G  215G   6% /data0
/dev/sdb1             271G   34G  223G  14% /data1
/dev/sde1             271G  173G   85G  68% /data4
/dev/sdi1             271G  170G   87G  67% /data8
/dev/sdj1             271G  192M  257G   1% /data9
/dev/sda6             3.8G   87M  3.5G   3% /tmp
/dev/sda5             9.5G  2.6G  6.4G  29% /usr
/dev/sda3             9.5G  732M  8.3G   8% /var
tmpfs                 5.9G     0  5.9G   0% /dev/shm
/dev/sdc1             275G  171G   90G  66% /data2
/dev/sdf1             275G  173G   89G  67% /data5
/dev/sdh1             275G  171G   90G  66% /data7
/dev/sdg1             271G  153G  104G  60% /data6
/dev/sdd1             275G  192M  261G   1% /data3

完:最近老是碰到磁盘出问题,记下今天的一个解决过程和思路,挖个坑,关于磁盘的问题以后遇到时再慢慢填。

2014-07-24

下面是一个服务器坏盘,换盘的流程:
a.更换硬盘前,可以看到Media Error Count: 2,说明磁盘已经有坏道产生:
Enclosure Device ID: 32
Slot Number: 2
Drive's postion: DiskGroup: 2, Span: 0, Arm: 0
Enclosure position: 0
Device Id: 2
WWN: 5000C50047A6F81C
Sequence Number: 2
Media Error Count: 2
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 279.396 GB [0x22ecb25c Sectors]
Non Coerced Size: 278.896 GB [0x22dcb25c Sectors]
Coerced Size: 278.875 GB [0x22dc0000 Sectors]
Firmware state: Online, Spun Up
Device Firmware Level: ES65
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x5000c50047a6f81d
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3300657SS ES656SJ4BLFX
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :47C (116.60 F)
PI Eligibility: No
Drive is formatted for PI information: No
PI: No PI
Drive's write cache : Disabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: Unknown
Drive has flagged a S.M.A.R.T alert : No




b.更换后磁盘正常,同时Firmware state: Unconfigured(good), Spun Up说明磁盘还没有在线,这时在系统中fdisk -l 是看不到磁盘的,无法使用磁盘.
Enclosure Device ID: 32
Slot Number: 2
Enclosure position: 0
Device Id: 2
WWN: 5000C50075EBB384
Sequence Number: 7
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 279.396 GB [0x22ecb25c Sectors]
Non Coerced Size: 278.896 GB [0x22dcb25c Sectors]
Coerced Size: 278.875 GB [0x22dc0000 Sectors]
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: ES66
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5000c50075ebb385
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3300657SS     ES666SJ84X55
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :45C (113.00 F)
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Drive's write cache : Disabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: Unknown
Drive has flagged a S.M.A.R.T alert : No

c.对新加的磁盘做raid 0,然后系统才可以正常使用
/opt/MegaRAID/MegaCli/MegaCli64 -CfgLdAdd -r0 [32:2] WB Direct -a0
Enclosure Device ID: 32
Slot Number: 2


可以看到这时Firmware state已经是Online状态,说明磁盘可以使用了
Enclosure Device ID: 32
Slot Number: 2
Drive's postion: DiskGroup: 9, Span: 0, Arm: 0
Enclosure position: 0
Device Id: 2
WWN: 5000C50075EBB384
Sequence Number: 8
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 279.396 GB [0x22ecb25c Sectors]
Non Coerced Size: 278.896 GB [0x22dcb25c Sectors]
Coerced Size: 278.875 GB [0x22dc0000 Sectors]
Firmware state: Online, Spun Up

显示所有逻辑磁盘组信息,下面就是对刚更换的这个磁盘做raid 0后的逻辑磁盘组信息,可以看到磁盘的读写策略等(如需要可以更改调试):
/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -LALL -aAll

Virtual Drive: 2 (Target Id: 2)
Name                :
RAID Level          : Primary-0, Secondary-0, RAID Level Qualifier-0
Size                : 278.875 GB
Parity Size         : 0
State               : Optimal
Strip Size          : 64 KB
Number Of Drives    : 1
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: Yes
Cache Cade Type : Read Only


此时,fdisk -l可以正常看到磁盘
/sbin/fdisk -l

Disk /dev/sdc: 299.4 GB, 299439751168 bytes
255 heads, 63 sectors/track, 36404 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/sdc doesn't contain a valid partition table

d.最后,分区、格式化、挂载:
/sbin/fdisk /dec/sdc
/sbin/mkfs.ext4 /dev/sdc1
/bin/mount /dec/sdc1 /data9

2014-10-27

换完硬盘后,发现硬盘Firmware state是JBOD,这时需要先disable jbod,然后再在线做raid 0,格式化挂载
[root@54 peiqiang]# /opt/MegaRAID/MegaCli/MegaCli64 -pdlist -aAll
Enclosure Device ID: 32
Slot Number: 3
Enclosure position: 0
Device Id: 3
WWN: 5000C500715CEAF4
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 279.396 GB [0x22ecb25c Sectors]
Non Coerced Size: 278.896 GB [0x22dcb25c Sectors]
Coerced Size: 278.875 GB [0x22dc0000 Sectors]
Firmware state: JBOD

[root@54 peiqiang]# /opt/MegaRAID/MegaCli/MegaCli64 -AdpSetProp -EnableJBOD -0 -aALL
Adapter 0: Set JBOD to Disable success.

Exit Code: 0x00

继续挖坑…