UFS fsck: Floating point exception?

Reaperzx

Active Member

Reaction score: 3
Messages: 105

Today morning my home server decided to restart for some reason. After coming up in single user mode, was able to fsck other file systems, but not the data disk:

Code:
pid 64 (fsck_ufs), jid 0, uid 0: exited on signal 8
fsck: /dev/da0p1: Floating point exception

da0p1 is 33TB RAID6 disk (8x6TB). I did check Areca RAID BIOS, all disks seem fine and in array.

Is my old 3570 CPU starting to fail? I did plan to replace it with 9700K from my desktop this summer...

I request advice, how to proceed. Restarting the system has no effect, same result.

If it comes to worst I still have my monthly backups... But would like to try without them in the beginning.

(HW probe link removed for privacy)
 
Last edited:
OP
R

Reaperzx

Active Member

Reaction score: 3
Messages: 105

Not possible in this FreeBSD server, because this is hardware RAID. But I will remove the disks from the server one-by-one and test them in my Windows desktop. SMART info should still be the same.
 

SirDice

Administrator
Staff member
Administrator
Moderator

Reaction score: 11,202
Messages: 37,354

Not possible in this FreeBSD server, because this is hardware RAID
Hardware RAID isn't going to protect you from single block failures. It's only helpful if a whole drive dies. Read errors are still going to happen. That's why ZFS for example has the ability to actually fix errors like that. Hardware RAID or other software RAID solutions only have error-detection, no error-correction. It can only restore data if an entire disk fails.

But I will remove the disks from the server one-by-one and test them in my Windows desktop. SMART info should still be the same.
No need to remove them. Many RAID controllers allow you to query an individual disk's SMART info through the controller.

Code:
areca,N -	[FreeBSD, Linux, Windows and Cygwin only]  the	device
	      consists	of  one	 or more SATA disks connected to an Areca SATA
	      RAID controller.	The positive integer N (in the range from 1 to
	      24 inclusive) denotes which disk on the controller is monitored.
	      On FreeBSD use syntax such as:
	      smartctl -a -d areca,2 /dev/arcmsr1
	      smartctl -a -d areca,3 /dev/arcmsr2
	      The first	line above addresses the  second  disk	on  the	 first
	      Areca RAID controller.  The second line addresses	the third disk
	      on the second Areca RAID controller.

	      Important: the Areca controller must have	firmware version  1.46
	      or later.	 Lower-numbered	firmware versions will give (harmless)
	      SCSI error messages and no SMART information.

	      areca,N/E	- [FreeBSD, Linux, Windows and Cygwin only] the	device
	      consists	of one or more SATA or SAS disks connected to an Areca
	      SAS RAID controller.  The	integer	N (range 1 to 128) denotes the
	      channel  (slot) and E (range 1 to	8) denotes the enclosure.  Im-
	      portant: This requires Areca  SAS	 controller  firmware  version
	      1.51 or later.
 

Alexey V. Gubin

New Member

Reaction score: 5
Messages: 9

Hardware RAID isn't going to protect you from single block failures. It's only helpful if a whole drive dies. Read errors are still going to happen. That's why ZFS for example has the ability to actually fix errors like that. Hardware RAID or other software RAID solutions only have error-detection, no error-correction. It can only restore data if an entire disk fails.

Is that so? In all hardware RAID controllers I have seen, if one of the disks develops a single bad block, which happens to be data (not parity), the data will be reconstructed when reading using parity.

If the disk provides incorrect data, that's a different story, because the controller trusts the drive to figure out ECC does not match in the sector, but if it is a plain read error, the controller will do the recovery on the fly.
 
OP
R

Reaperzx

Active Member

Reaction score: 3
Messages: 105

Well, I found one malfunctioning disk. 3 year old Toshiba X300. No bads or reallocs, but sometimes read access times 3+ seconds.

https://pasteboard.co/K35pUVO.png

I am not sure this would solve my fsck problem, but I have to replace the hard disk regardless. It pains me that I have to buy another HDD, when I actually want to move to SSD-s. But SSD-s over 4TB are very expensive. Although even hard drives seem to have got more expensive recently...
 

fcorbelli

Active Member

Reaction score: 51
Messages: 162

It's not always that easy, parity may not even exist at all (eg mirror). In this case a HW check has no way to know if the data of disk 1 is valid, or that of disk 2, when a sector is damaged.

Back to the question: why use HW RAID with FreeBSD?
Why keep fighting with fsck?

And the last: 4TB-SSD WD wds400t1r0a costs about 500 euro each
 
OP
R

Reaperzx

Active Member

Reaction score: 3
Messages: 105

Yea, I know the prices. Could get SAMSUNG 870 QVO 4TB even for 350 EUR, including VAT. But that would mean decreasing array size.

Anyway, that is besides the point.

EDIT: SAMSUNG 870 QVO 8TB goes for 750 EUR. That's not cheap, but in a couple of years, maybe...
 

covacat

Well-Known Member

Reaction score: 133
Messages: 274

try fsck -f or kill the journal before fsck
the single floating point op in the src is related to some journal stats
 
OP
R

Reaperzx

Active Member

Reaction score: 3
Messages: 105

I did start with fsck -f -y, same result.

Will have to investigate journal, never touched it.
 
OP
R

Reaperzx

Active Member

Reaction score: 3
Messages: 105

Well, yes, might as well test the RAM now that the server is not working...
 

fcorbelli

Active Member

Reaction score: 51
Messages: 162

Well, yes, might as well test the RAM now that the server is not working...
You may have a RAM problem, it would be unusual but not to be ruled out.
From your log it does not seem to be ECC, just normal DDR.
I would do a couple of rounds of tests, just to be sure...

For speed reasons, in test and rebuilding, I would leave only one DIMM
8GB should be more than enough
 
OP
R

Reaperzx

Active Member

Reaction score: 3
Messages: 105

Erm, no. It's pretty tight fit. If I am going to dismantle it, I am sure I might as well replace the motherboard+RAM+CPU:

https://pasteboard.co/HMPVsHo.jpg

Only disks are hot swap.

Will leave memtest over night. Let's see what it shows. Although I have very rarely seen RAM go bad.
 

fcorbelli

Active Member

Reaction score: 51
Messages: 162

Erm, no. It's pretty tight fit. If I am going to dismantle it, I am sure I might as well replace the motherboard+RAM+CPU:

https://pasteboard.co/HMPVsHo.jpg

Only disks are hot swap.

Will leave memtest over night. Let's see what it shows. Although I have very rarely seen RAM go bad.
From the picture you can take 3 DIMMs without much efforts, going to 1/4 of test time.
PS OT: why you do not put a sman fan just in the controller chip? It's the first thing I do with separate controller
 
OP
R

Reaperzx

Active Member

Reaction score: 3
Messages: 105

OT: There is black FAN next to blue Zalman aftermarket heatsink on RAID controller. Both much bigger than original. Original fan died.
OT: Yes, 10G network card does not have a fan. But it is cool enough in the cellar. And I have spares.
OT: In Windows desktop I have lost couple of 10G cards already. But fan would add more noise, important with desktop...
OT: I guess I should get a bigger case for next server build...
 

mtu

Active Member

Reaction score: 65
Messages: 117

OT: I have that exact same CPU cooler in my gaming box. I remember spending days researching it, because it installs both ways (vertical and horizontal air flow) on Intel CPUs, but only one way (horizontal) on AMD, which was a headache to figure out.
 
OP
R

Reaperzx

Active Member

Reaction score: 3
Messages: 105

Well, now the system is up. No memtest errors.

Seems that after removing faulty disk no more floating point error. No logic, but that's how it is.

Had to run fsck 3 times, 2 first times gave "file system still dirty" "please rerun fsck".

At first glance files seem to be OK, but of course there could be some data corruption somewhere.

Now running RAID in degraded mode, waiting for replacement hard disk to arrive.
 
OP
R

Reaperzx

Active Member

Reaction score: 3
Messages: 105

Why do I use FreeBSD? Out of habit, I guess, some 15 years already. It usually works fine, if I don't mess with it. Most failures have been from hardware (disks, PSU, mainboard).

Why do I use UFS? I guess no skills for ZFS.

Why do I use hardware RAID? Out of habit too, I guess.

Although I recently set up offsite backup server with 20 older hard disks 1-3TB size. Graid5 configuration. Let's say it was not a trivial thing either, took me several months of my free time to set it up:
https://bsd-hardware.info/?probe=63db0c4563
https://pasteboard.co/K3o5Yrd.png
https://forums.freebsd.org/threads/newfs-wtfs-invalid-argument.77897/
https://forums.freebsd.org/threads/aquantia-5gigabit-nic-driver-status-aqc108.77821/
 

ralphbsz

Son of Beastie

Reaction score: 2,097
Messages: 3,061

Why do I use UFS? I guess no skills for ZFS.
While the rest of your post is very reasonable, you should question this assumption. Learning how to set up and use ZFS is not very hard (a half dozen concepts, a half dozen commands, if you want to just replace UFS with something more reliable). And it's a good investment in data durability. Once you know how to do it, you can take all the graid and hardware-specific stuff and forget it, so your life will become easier in the long run.
 

richardtoohey2

Aspiring Daemon

Reaction score: 266
Messages: 524

Why do I use FreeBSD? Out of habit, I guess, some 15 years already. It usually works fine, if I don't mess with it. Most failures have been from hardware (disks, PSU, mainboard).

Why do I use UFS? I guess no skills for ZFS.

Why do I use hardware RAID? Out of habit too, I guess.
I'd answer the same to all those - you are not alone.

I've looked at ZFS and there's a lot to learn. There are a lot of extra benefits to ZFS, too, I get that - but lots to learn, especially to get the best performance for different workloads (admittedly stuff I've not learned too much about before using hardware RAID and UFS!)

And there seem to be quite a few off-putting stories on these forums about people having issues with ZFS, ARC & RAM, etc. And yes, there are horror stories about hardware RAID and UFS, too. But better the devil you know.
 

fcorbelli

Active Member

Reaction score: 51
Messages: 162

I am primary a data storage manager, from CP/M to OS/2 to Solaris to now.
I do not really see a single reason to NOT use zfs.
The PC become a 'giant raid controller' with hundreds of GB and tenths of cores.
For me it is the easiest filesystem to manage.
You can make a 'raid volume' with 3 or 4 commands, that's it.

Nothing else to do (only one exceprion : limit arc size, TWO edit in a config text file).

Done.

No configuration no optimization needed

More advanced functions? Ask on the forum!

Short version: FreeBSD with zfs will become your 'dream', 90% of everything you ever want is just here.

I always think that UFS is like a baseball bat, light and reliable.
But zfs is like a ICBM.
Some gotcha, but definely more powerful of a baseball bat

PS there is no fsck with zfs (without deduplication).
I repeat .
There is no fsck at all.
No, there is not a 'hidden' fsck or whatever
Just this is enough for a datamanager to throw away everything else
 

grahamperrin

Aspiring Daemon

Reaction score: 189
Messages: 643

fsck: /dev/da0p1: Floating point exception

… I recently set up offsite backup server …


To clarify: is the probe of the machine where you get the exception with fsck? Or of a server elsewhere?

(Underlying question: with which version of FreeBSD do you get the exception?)
 
Top