gjournal read error and spontaneous reboot on file removal attempt

Hi --

Yesterday morning I did realise an application crashing in one of my service jails (7.2-RELEASE, dedicated partition) due to:

Code:
Jan 14 00:18:54 <kern.crit> xyz kernel: g_vfs_done():mirror/gm0s1g.journal[READ(offset=3625792423936, length=4096)]error = 5
Jan 14 00:18:54 <kern.crit> xyz kernel: vnode_pager_getpages: I/O read error
Jan 14 00:18:54 <kern.crit> xyz kernel: vm_fault: pager read error, pid 1273 (application)

This is a gmirror raid1 mirror with gjournal on top. smartdtools indicates that both disks are in good shape.

I could find a file throwing this error while attempting to read it's content (dd, sum). Thus, I tried to remove it, but, to my surprise, I noticed my machine to reboot spontaneously without leaving any hint in all system logs. This is reproducable :-(

Well, now I'd like to get some feedback on how to recover from those read errors in that partition without further spontaneous reboots of my production server.

I'm pretty new to FBSD, thus I'd like to get some feedback on how to achieve that:

1. Should I re-format that partition and start from scratch (recent backup is available)?
2. Might a dd if=mirror/gm0s1g.journal be helpful for tracking and remapping bad blocks?
3. Does one need to remove the journal before?
4. Or, just unmount and fsck that partition?

Thanks for your feedback in advance!
 
My company supports numerous systems running gjournal on top of SATA or SAS drives or hardware RAID arrays. We've seen a single system exhibit the same behavior that you describe... it was a system with HP Smart RAID, running 8.0-RELEASE.

The system was a lab box, so we just reformatted it and started over... but the experience definitely is a bit worrisome. I would love to have a better understanding of what to do in this situation -- in terms of both root-cause analysis and recovery/repair options. Most importantly, does this phenomenon point to a correctable weakness in some layer of the storage system?

What did you end up doing?
 
charles said:
What did you end up doing?

Sorry for my timely response ;-)

First, I migrated to FBSD-8.0 to no avail. In all those months I had had spontaneous reboots without any hints in my log files, strange.

Only by chance I could identify the cause of those reboots: an application running in that jail used Berkeley DB 4.7 functionality. Database maintenance -triggered by a cron job- caused a database index file to be copied to a new instance. Once in a while, the final removal of the old index file immediately ended in a reboot. I could reproduce that behavior by trying to remove that index file by myself (approximately 1 reboot out of 10 trials of removal).

I am completely clueless what might have been the cause, though. But, after replacing that Berkeley database functionality by a completely different one (6 month ago), I haven't had any spontaneous reboot ever since.
 
Back
Top