File system still dirty

korovamilk · Oct 26, 2016

Hi there,
one of our FreeBSD system fails to boot with following errors:

Code:

(da0:mps0:0:1:0): READ(10). CDB: 28 0 52 47 5f df 0 0 80 0
(da0:mps0:0:1:0): CAM status: SCSI Status Error
(da0:mps0:0:1:0): SCSI status: Check Condition
(da0:mps0:0:1:0): SCSI sense: MEDIUM ERROR info?:c0010000 asc:0;0 (No additional sense information)

/dev/da0s1f: CANNOT READ BLK: 1300717472
/dev/da0s1f: UNEXPECTED SOFT UPDATE INCONSISTENCY: RUN fsck MANUALLY.

after rebooting the system in single user mode, we launched

Code:

fsck -y /dev/da0s1f

result was:

Code:

(da0:mps0:0:1:0): READ(10). CDB: 28 00 52 47 60 31 00 00 01 00
(da0:mps0:0:1:0): CAM status: SCSI Status Error
(da0:mps0:0:1:0): SCSI status: Check Condition
(da0:mps0:0:1:0): SCSI sense: MEDIUM ERROR asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
(da0:mps0:0:1:0): Retrying command (per sense data)
(da0:mps0:0:1:0): Error 5, Retries exhausted

THE FOLLOWING DISK SECTORS COULD NOT BE READ: 1300717554

FILE SYSTEM STILL DIRTY
PLEASE RERUN FSCK

And so did we, with no luck.

The slice /dev/da0s1f (/usr) is on a "Dell Virtual Disk 1028" (2 disks in RAID1), Dell SAS2008 Raid Controller.
Please note that the RAID status is marked as "optimal" on SAS Utility.

We called the IT guy on site and made him pull out one drive, then we booted the system in single user mode and launched fsck again but no luck. Then I made the IT guy pull out the drive and insert the other one: same story (and same sectors).

We also cold booted the system and performed various fsck, also from live cd, always without luck.

Can someone please point us to the right directions to make the server run again?

Thanks in advance

Uniballer · Oct 26, 2016

korovamilk said:
Can someone please point us to the right directions to make the server run again?

The classic solution in these cases, and the only one guaranteed to work, is to fix or replace the failing disk subsystem hardware, then restore from a recent backup.

korovamilk · Oct 27, 2016

Uniballer said:
The classic solution in these cases, and the only one guaranteed to work, is to fix or replace the failing disk subsystem hardware, then restore from a recent backup.

That would be the next step (they already ordered a new server for that scope).
I'm just looking for tips on additional steps in order to investigate the failure and better understand how this RAID1 failed on both disks (if so) and how FreeBSD copes with that.

ANOKNUSA · Oct 27, 2016

korovamilk said:
I'm just looking for tips on additional steps in order to investigate the failure and better understand how this RAID1 failed on both disks (if so) and how FreeBSD copes with that.

I believe this is what Uniballer was getting at, but just in case: in my experience, "CAM Status" errors, followed by "Error 5, retries exhausted," point to lower-level hardware failures. Not disk failures, but a failure of the underlying hardware to interact with the disk firmware, or an inability of FreeBSD to interact with the peripheral's firmware. That is to say that it may be your RAID controller that failed, not the disk. FreeBSD copes with that as well as any other operating system (it doesn't).

korovamilk · Oct 27, 2016

First of all I'd like to thank both of you for taking the time to read (and reply to) my post.

Then an update: the IT staff on site just replaced the RAID controller. I started the system with degraded RAID (just one disk) in single user mode, then tried to fsck the disk.
Unfortunately fsck is unable to read some sector, and the result is always "FILE SYSTEM STILL DIRTY -PLEASE RERUN FSCK".

My next attempt would be to make the on site IT guy to pull out this disk and push the other in, then I would repeat the above fsck operation in single user mode hoping that the other disk has no damage.
Do you people have any other suggestions?

MarcoB · Oct 27, 2016

I've had this and solved it by running fsck without the "-y". Then you have to work your way through all the questions but in my case the fs was actually cleaned, and with the "-y" included, it wasn't.

ANOKNUSA · Oct 27, 2016

MarcoB said:
I've had this and solved it by running fsck without the "-y".

Better yet, run fsck -p first. This will tell you whether problems exist, and how many, without actually doing anything to the filesystem. It will give you an idea of just how messed-up things might be. As for going over each issue one-by-one, I've never had an instance where a single problem trips up the entire fsck myself, but I've heard of such instances.

Terry_Kennedy · Oct 28, 2016

korovamilk said:
Hi there,
one of our FreeBSD system fails to boot with following errors:

Code:

(da0:mps0:0:1:0): READ(10). CDB: 28 0 52 47 5f df 0 0 80 0 (da0:mps0:0:1:0): CAM status: SCSI Status Error (da0:mps0:0:1:0): SCSI status: Check Condition (da0:mps0:0:1:0): SCSI sense: MEDIUM ERROR info?:c0010000 asc:0;0 (No additional sense information)

As other replies have stated, FreeBSD thinks that either the controller or the drive(s) are broken. fsck(8) is designed to find and fix logical inconsistencies, not physical ones. At the time fsck(8) came along to replace a bunch of manual utilities like clri(8), your disk drive (the size of a washing machine) was probably broken, and you'd call the repair number on the side of the computer, and someone from the company you paid big $ every month would come out and fix it. Or break it if it was actually working properly.

Anyway, when fsck(8) can't read one or more blocks due to a hardware error (assuming it isn't something that is stored redundantly) it just goes "I give up. Want to try again?" which is a huge improvement on older implementations, which could cause a "death spiral" of more and more errors each time you ran fsck(8) until things degraded to the point where it would say "Filesystem? What filesystem?"

In any event, the first step is the same as in medicine: "First, do no harm". Which means not trying anything that could lead to further potential data loss until all other avenues have been exhausted.

The slice /dev/da0s1f (/usr) is on a "Dell Virtual Disk 1028" (2 disks in RAID1), Dell SAS2008 Raid Controller.
Please note that the RAID status is marked as "optimal" on SAS Utility.

We called the IT guy on site and made him pull out one drive, then we booted the system in single user mode and launched fsck again but no luck. Then I made the IT guy pull out the drive and insert the other one: same story (and same sectors).

The status will remain "optimal" until the controller does a scheduled "patrol read" and discovers the issue for itself, or you manually ask it to verify the volume (normally in the Control-R BIOS menu). Back in the "old days", RAID controllers had tricks like writing sectors with the complement of the checksum to indicate "You can read this data without error, but I (the controller) have no idea what data should be here, so I'm letting you know."