UFS fsck: Floating point exception?

Reaperzx · May 22, 2021

Today morning my home server decided to restart for some reason. After coming up in single user mode, was able to fsck other file systems, but not the data disk:

Code:

pid 64 (fsck_ufs), jid 0, uid 0: exited on signal 8
fsck: /dev/da0p1: Floating point exception

da0p1 is 33TB RAID6 disk (8x6TB). I did check Areca RAID BIOS, all disks seem fine and in array.

Is my old 3570 CPU starting to fail? I did plan to replace it with 9700K from my desktop this summer...

I request advice, how to proceed. Restarting the system has no effect, same result.

If it comes to worst I still have my monthly backups... But would like to try without them in the beginning.

(HW probe link removed for privacy)

SirDice · May 22, 2021

Reaperzx said:
da0p1 is 33TB RAID6 disk (8x6TB). I did check Areca RAID BIOS, all disks seem fine and in array.

One or more of your disks may have bad blocks. Check the SMART data of each disk individually with smartctl(8).

Reaperzx · May 22, 2021

Not possible in this FreeBSD server, because this is hardware RAID. But I will remove the disks from the server one-by-one and test them in my Windows desktop. SMART info should still be the same.

SirDice · May 22, 2021

Reaperzx said:
Not possible in this FreeBSD server, because this is hardware RAID

Hardware RAID isn't going to protect you from single block failures. It's only helpful if a whole drive dies. Read errors are still going to happen. That's why ZFS for example has the ability to actually fix errors like that. Hardware RAID or other software RAID solutions only have error-detection, no error-correction. It can only restore data if an entire disk fails.

Reaperzx said:
But I will remove the disks from the server one-by-one and test them in my Windows desktop. SMART info should still be the same.

No need to remove them. Many RAID controllers allow you to query an individual disk's SMART info through the controller.

Code:

areca,N -	[FreeBSD, Linux, Windows and Cygwin only]  the	device
	      consists	of  one	 or more SATA disks connected to an Areca SATA
	      RAID controller.	The positive integer N (in the range from 1 to
	      24 inclusive) denotes which disk on the controller is monitored.
	      On FreeBSD use syntax such as:
	      smartctl -a -d areca,2 /dev/arcmsr1
	      smartctl -a -d areca,3 /dev/arcmsr2
	      The first	line above addresses the  second  disk	on  the	 first
	      Areca RAID controller.  The second line addresses	the third disk
	      on the second Areca RAID controller.

	      Important: the Areca controller must have	firmware version  1.46
	      or later.	 Lower-numbered	firmware versions will give (harmless)
	      SCSI error messages and no SMART information.

	      areca,N/E	- [FreeBSD, Linux, Windows and Cygwin only] the	device
	      consists	of one or more SATA or SAS disks connected to an Areca
	      SAS RAID controller.  The	integer	N (range 1 to 128) denotes the
	      channel  (slot) and E (range 1 to	8) denotes the enclosure.  Im-
	      portant: This requires Areca  SAS	 controller  firmware  version
	      1.51 or later.

Alexey V. Gubin · May 22, 2021

SirDice said:
Hardware RAID isn't going to protect you from single block failures. It's only helpful if a whole drive dies. Read errors are still going to happen. That's why ZFS for example has the ability to actually fix errors like that. Hardware RAID or other software RAID solutions only have error-detection, no error-correction. It can only restore data if an entire disk fails.

Is that so? In all hardware RAID controllers I have seen, if one of the disks develops a single bad block, which happens to be data (not parity), the data will be reconstructed when reading using parity.

If the disk provides incorrect data, that's a different story, because the controller trusts the drive to figure out ECC does not match in the sector, but if it is a plain read error, the controller will do the recovery on the fly.

Reaperzx · May 22, 2021

Well, I found one malfunctioning disk. 3 year old Toshiba X300. No bads or reallocs, but sometimes read access times 3+ seconds.

https://pasteboard.co/K35pUVO.png

I am not sure this would solve my fsck problem, but I have to replace the hard disk regardless. It pains me that I have to buy another HDD, when I actually want to move to SSD-s. But SSD-s over 4TB are very expensive. Although even hard drives seem to have got more expensive recently...

Deleted member 67440 · May 22, 2021

It's not always that easy, parity may not even exist at all (eg mirror). In this case a HW check has no way to know if the data of disk 1 is valid, or that of disk 2, when a sector is damaged.

Back to the question: why use HW RAID with FreeBSD?
Why keep fighting with fsck?

And the last: 4TB-SSD WD wds400t1r0a costs about 500 euro each

Reaperzx · May 22, 2021

Yea, I know the prices. Could get SAMSUNG 870 QVO 4TB even for 350 EUR, including VAT. But that would mean decreasing array size.

Anyway, that is besides the point.

EDIT: SAMSUNG 870 QVO 8TB goes for 750 EUR. That's not cheap, but in a couple of years, maybe...

covacat · May 22, 2021

try fsck -f or kill the journal before fsck
the single floating point op in the src is related to some journal stats

Reaperzx · May 22, 2021

I did start with fsck -f -y, same result.

Will have to investigate journal, never touched it.

Deleted member 67440 · May 22, 2021

Reaperzx said:
Yea, I know the prices. Could get SAMSUNG 870 QVO 4TB even for 350 EUR, including VAT. But that would mean decreasing array size.

Anyway, that is besides the point.

QVOs are NOT a smart choice.
Turning back to the question, already tested the RAM with https://www.memtest.org/?

Reaperzx · May 22, 2021

Well, yes, might as well test the RAM now that the server is not working...

Deleted member 67440 · May 22, 2021

Reaperzx said:
Well, yes, might as well test the RAM now that the server is not working...

You may have a RAM problem, it would be unusual but not to be ruled out.
From your log it does not seem to be ECC, just normal DDR.
I would do a couple of rounds of tests, just to be sure...

For speed reasons, in test and rebuilding, I would leave only one DIMM
8GB should be more than enough

Reaperzx · May 22, 2021

Erm, no. It's pretty tight fit. If I am going to dismantle it, I am sure I might as well replace the motherboard+RAM+CPU:

https://pasteboard.co/HMPVsHo.jpg

Only disks are hot swap.

Will leave memtest over night. Let's see what it shows. Although I have very rarely seen RAM go bad.

Deleted member 67440 · May 22, 2021

Reaperzx said:
Erm, no. It's pretty tight fit. If I am going to dismantle it, I am sure I might as well replace the motherboard+RAM+CPU:

https://pasteboard.co/HMPVsHo.jpg

Only disks are hot swap.

Will leave memtest over night. Let's see what it shows. Although I have very rarely seen RAM go bad.

From the picture you can take 3 DIMMs without much efforts, going to 1/4 of test time.
PS OT: why you do not put a sman fan just in the controller chip? It's the first thing I do with separate controller

Reaperzx · May 22, 2021

OT: There is black FAN next to blue Zalman aftermarket heatsink on RAID controller. Both much bigger than original. Original fan died.
OT: Yes, 10G network card does not have a fan. But it is cool enough in the cellar. And I have spares.
OT: In Windows desktop I have lost couple of 10G cards already. But fan would add more noise, important with desktop...
OT: I guess I should get a bigger case for next server build...

mtu · May 23, 2021

OT: I have that exact same CPU cooler in my gaming box. I remember spending days researching it, because it installs both ways (vertical and horizontal air flow) on Intel CPUs, but only one way (horizontal) on AMD, which was a headache to figure out.

Reaperzx · May 24, 2021

Well, now the system is up. No memtest errors.

Seems that after removing faulty disk no more floating point error. No logic, but that's how it is.

Had to run fsck 3 times, 2 first times gave "file system still dirty" "please rerun fsck".

At first glance files seem to be OK, but of course there could be some data corruption somewhere.

Now running RAID in degraded mode, waiting for replacement hard disk to arrive.

Deleted member 67440 · May 24, 2021

Why you use UFS+HW RAID with FreeBSD?

Reaperzx · May 24, 2021

Why do I use FreeBSD? Out of habit, I guess, some 15 years already. It usually works fine, if I don't mess with it. Most failures have been from hardware (disks, PSU, mainboard).

Why do I use UFS? I guess no skills for ZFS.

Why do I use hardware RAID? Out of habit too, I guess.

Although I recently set up offsite backup server with 20 older hard disks 1-3TB size. Graid5 configuration. Let's say it was not a trivial thing either, took me several months of my free time to set it up:
https://bsd-hardware.info/?probe=63db0c4563
https://pasteboard.co/K3o5Yrd.png
https://forums.freebsd.org/threads/newfs-wtfs-invalid-argument.77897/
https://forums.freebsd.org/threads/aquantia-5gigabit-nic-driver-status-aqc108.77821/

ralphbsz · May 24, 2021

Reaperzx said:
Why do I use UFS? I guess no skills for ZFS.

While the rest of your post is very reasonable, you should question this assumption. Learning how to set up and use ZFS is not very hard (a half dozen concepts, a half dozen commands, if you want to just replace UFS with something more reliable). And it's a good investment in data durability. Once you know how to do it, you can take all the graid and hardware-specific stuff and forget it, so your life will become easier in the long run.

richardtoohey2 · May 25, 2021

Reaperzx said:
Why do I use FreeBSD? Out of habit, I guess, some 15 years already. It usually works fine, if I don't mess with it. Most failures have been from hardware (disks, PSU, mainboard).

Why do I use UFS? I guess no skills for ZFS.

Why do I use hardware RAID? Out of habit too, I guess.

I'd answer the same to all those - you are not alone.

I've looked at ZFS and there's a lot to learn. There are a lot of extra benefits to ZFS, too, I get that - but lots to learn, especially to get the best performance for different workloads (admittedly stuff I've not learned too much about before using hardware RAID and UFS!)

And there seem to be quite a few off-putting stories on these forums about people having issues with ZFS, ARC & RAM, etc. And yes, there are horror stories about hardware RAID and UFS, too. But better the devil you know.

richardtoohey2 · May 25, 2021

fcorbelli said:
QVOs are NOT a smart choice.

+1 they are targetted at the lower-end market - lots of space but not much else to say about them!

Deleted member 67440 · May 25, 2021

I am primary a data storage manager, from CP/M to OS/2 to Solaris to now.
I do not really see a single reason to NOT use zfs.
The PC become a 'giant raid controller' with hundreds of GB and tenths of cores.
For me it is the easiest filesystem to manage.
You can make a 'raid volume' with 3 or 4 commands, that's it.

Nothing else to do (only one exceprion : limit arc size, TWO edit in a config text file).

Done.

No configuration no optimization needed

More advanced functions? Ask on the forum!

Short version: FreeBSD with zfs will become your 'dream', 90% of everything you ever want is just here.

I always think that UFS is like a baseball bat, light and reliable.
But zfs is like a ICBM.
Some gotcha, but definely more powerful of a baseball bat

PS there is no fsck with zfs (without deduplication).
I repeat .
There is no fsck at all.
No, there is not a 'hidden' fsck or whatever
Just this is enough for a datamanager to throw away everything else

grahamperrin · Jun 11, 2021

Reaperzx said:
fsck: /dev/da0p1: Floating point exception

Reaperzx said:
… I recently set up offsite backup server …

Reaperzx said:
HW probe of ASRock X370 Professional Ga... Desktop Computer #63db0c4563

A database of all the hardware that works under bsd

bsd-hardware.info

To clarify: is the probe of the machine where you get the exception with fsck? Or of a server elsewhere?

(Underlying question: with which version of FreeBSD do you get the exception?)

UFS fsck: Floating point exception?

Reaperzx

SirDice

Administrator

Reaperzx

SirDice

Administrator

Alexey V. Gubin

Reaperzx

Deleted member 67440

Guest

Reaperzx

covacat

Reaperzx

Deleted member 67440

Guest

Reaperzx

Deleted member 67440

Guest

Reaperzx

Deleted member 67440

Guest

Reaperzx

mtu

Reaperzx

Deleted member 67440

Guest

Reaperzx

ralphbsz

richardtoohey2

richardtoohey2

Deleted member 67440

Guest

grahamperrin

HW probe of ASRock X370 Professional Ga... Desktop Computer #63db0c4563