crashes on high cpu/io load SS4200

hi folks,

i have an SS4200 for testing purposes. it is running freebsd 8-stable with 2gb of ram and a celeron cpu with 1.6Ghz frequency. sadly i encounter on a rsync test a reboot of the machine. Ah the filesystem is zfs and of course geli encrypted.

There is no error message to find in the logs, i attached a serial device as well to see if something like a kernel trap is spitted out before, but nothing. i can simply reproduce the problem by transfer data via rsync to it(i doubt that rsync is here the problem, but well).

as i am debugging for quite some time on it, here is some lines of top -SP

Code:
CPU:  0.0% user,  0.0% nice,  0.0% system,  0.4% interrupt, 99.6% idle
Mem: 36M Active, 15M Inact, 506M Wired, 92K Cache, 31M Buf, 1430M Free
Swap: 4096M Total, 4096M Free

  PID USERNAME  THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
   11 root        1 171 ki31     0K     8K RUN     96.7H 55.27% idle
12218 root        1  76    0 84184K 19496K tx->tx   0:13 17.19% rsync
12217 root        1  73    0 50392K 11812K select   1:03 10.99% rsync

It's very interesting that always, short before the system crashes it idles almost 100%. Though rsync isn't done yet, and short before the failure the system is at 100% usage.

Now, i suspect it to be some kind of hardware problem, maybe the heat, maybe something else. My question would be if somebody else is using this device with success or knows this issue maybe in correlation of the celeron cpu. I would be interested as well if there is default modules i could try to load to get support on possibly existing sensors for heat and so on. No cpu temp measuring recognized so far(at least not under sysctl acpi where i suspected it to be).

Ok, maybe some of you can help, best regards,
 
I don't know much, but I had a machine many years ago that would reboot on high network loads (without dumps, errors in the logs, anything, really), which I eventually traced to a NIC that was flaky.

In any case, with no crash dumps or log messages, one of the first things I suspect is hardware failure of some kind.
 
fronclynne said:
I don't know much, but I had a machine many years ago that would reboot on high network loads (without dumps, errors in the logs, anything, really), which I eventually traced to a NIC that was flaky.

In any case, with no crash dumps or log messages, one of the first things I suspect is hardware failure of some kind.

to be honest i suspected this the most ... but as it was so unpredictable on the behavior just threw this away ... as u had the same experience i'll go and check this again.
 
Can you successfully run
[CMD="zpool"]scrub mypool[/CMD]
without it hanging?

I'm having a similar problem with a pool of 4 geli attached 1TiB drives and it always silently hangs during a scrub.
 
edwtjo said:
Can you successfully run
[CMD="zpool"]scrub mypool[/CMD]
without it hanging?

I'm having a similar problem with a pool of 4 geli attached 1TiB drives and it always silently hangs during a scrub.

interesting, i have got 4 1tb hdd's with zfs as well, they are not in the same raid. i will do the scrub test, let's see if it hangs as well.
 
I've reduced the complexity of the setup and eliminated Geli but the problem persists, i.e. system hangs on IO load.

Lets compare loader.conf. I'm running amd64 and my loader.conf has the following set
Code:
hptrr_load="YES"
zfs_load="YES"
geom_eli_load="YES"
snd_hda_load="YES"
net.fibs=6
vfs.root.mountfrom="zfs:rootzfs"
hw.pci.enable_msi=0
hw.pci.enable_msix=0

vm.kmem_size_max="1024M"
vm.kmem_size="1024M"
vfs.zfs.arc_max="100M"

We might have a common denominator.

This link seems somewhat related. Since I'm running 64bit this forum thread is related.
 
first, the scrubbing took a while but it is finished now, no hung ups during this process(my current setup has 2x raid 1). here is my old loader.conf:

Code:
geom_eli_load="YES"
zfs_load="YES"
boot_multicons="YES"
boot_serial="YES"
comconsole_speed="9600"
console="comconsole,vidconsole"
vm.kmem_size="512M"
vm.kmem_size_max="512M"
vfs.zfs.prefetch_disable=0
vfs.zfs.arc_max="1024M"

values i changed:
Code:
vfs.zfs.prefetch_disable=1
vfs.zfs.arc_min=122880000
vfs.zfs.arc_max=983040000
vfs.zfs.vdev.cache.size=8388608

now i run some more tests and see if it is more stable, as i understood from the links is that prefetching is somewhat very memory exhaustive, still i thought this should drop a msg before rebooting, but well - lets test. on quite difference is i run the setup onto a i386 installation.
 
ok, i don't want to act premature, but since last night, no crashes with the box. :)
i am unsure if the new arc / vdev values did the trick, prefetch=1 or it was just luck. any new results on your side?
 
I've been unable to get a stable system by tweaking tunables. My testing is to issue a
Code:
dd if=/dev/random | ssh edwtjo@localbox dd of=/mnt/raidz1pool/testfile
over gigabit ethernet until the pool is completely full. I think that using /dev/random is more courteous to the FreeBSD box since it will need to block now and then, thus reducing throughput, but so far this crashes my machine every single run.

I'm going to switch from ZFS, its CDDL encumbered anyway, to geom_raid5 by Arne Wörner.

Well, it seems I'm out of luck since I cannot get the graid5 code working on 8.0-p2 so I'm going to try gvinum instead...
 
last night the system crashed at my site as well :( so i guess i will give it a shot with amd64 installation and an external usb to ethernet device to check for a flaky network card.
if this is not working, i am unsure about a good replacement for zfs. from what i understood and have seen so far zfs is very good in not loosing files and data during unexpected reboots, which is kinda important for me ;) in difference to my usage of UFS. but maybe this is more stable using it at raid level :stud
 
Well using geom_raid5 and geom_journal should be pretty fault tolerant as well or at least good enough :) And it will most definite be faster.

On the bright side you do get back ACL's, which is efficiently eliminated with ZFS.
 
Kernel Segfault != Kernel Panic

eyebone said:
still i thought this should drop a msg before rebooting, but well - lets test. on quite difference is i run the setup onto a i386 installation.

I missed this.. Well if the kernel segfaults you will not get any output at all. Your reboot is indicative of a panic though.
 
This is interesting. I have several things in the crontab at the moment. zfs-snapshot-mgmt, a regular rsync of one pool to another, and every day a scrub of both zpools. Excessive, but I'm as much testing as anything. The box has hung twice lately, the second after being up for about 4 days (and running from a UPS, so I would think that should lower the likelihood of a voltage variation being responsible). Suspiciously, the last snapshot was right around the time of the zpool scrub.

zfs commands zfs-snapshot-mgmt executes:
# zfs destroy
# zfs snapshot
# zfs mount
# zpool status

Additional zfs related commands I am running in my transfer script (not present in zfs-snapshot-mgmt:
# zfs clone
# zfs set mountpoint
# zfs set readonly

I note that zfs destroy has been known to hang machines. It did on mine. I thought -r was to blame, and so I removed it from my scripts. From a cursory glance at the ruby script zfs-snapshot-mgmt that the -r option is used in that script (though I don't know ruby, it looks like the -r option is used).

I've rescheduled the zpool scrubs for once every Saturday. If the machine dies before then, I'll blame zfs-snapshot-mgmt. If it dies on saturday, looks like scrubbing is the problem.
 
I saw that the scrub restart if i made a snapshot while the scrub is running.

Do you copy only from ZFS to ZFS?
I got troubles with hanging system (8.0) while i was copying millions of files between zfs and ufs.
I switched to a ZFS only and the problem disappeared.
 
User23 said:
I saw that the scrub restart if i made a snapshot while the scrub is running.

Do you copy only from ZFS to ZFS?
Yes - I do only copy from ZFS to ZFS. I'll have to have a closer look at the scrub, to see if it resets. You do mean reset, don't you, not stop temporarily? Scrub usually starts then stops for a minute or so, then resumes again, irrespective of snapshots.
 
Hm, i made some snapshots during a scrub was running. The scrub was still running and not reset to "0% done". Strange ... iam not sure if i saw that behavior on FreeBSD 7.3 or 8.0.

Atm i playing around with zfs send/recv between zfs filesystems on two different machines.
Maybe zfs send/recv could be a option to you to prevent the high io load on your system while using rsync.
 
User23 said:
Atm i playing around with zfs send/recv between zfs filesystems on two different machines.
Maybe zfs send/recv could be a option to you to prevent the high io load on your system while using rsync.
Thanks for getting back to me about the zfs snapshot during scrub.

As a matter of fact I first tried zfs send/receive as an option. Unfortunately I figured out that using zfs send/receive is not really appropriate for my application. I'll let you in on what I'm doing - writing a guide to setting up an ultra-reliable workstation using a zfs root mirror, a RAIDZ2 storage pool, and regular backups. I'll post the howto here when I'm done.

So, there is zroot, which is the SSD based root mirror. Then there is storage, which is the HDD based RAIDZ2 storage pool. On storage there is the home directory (storage/home) and the backup directory for zroot (storage/zrootbackup). Assuming that you use small SSDs for zroot, we should take maybe a week or two worth of snapshots for quick rollback capability. But we can transfer zroot to storage regularly and retain more snapshots, so that we can restore from arbitrarily far back as well as having a bit more reliability (if both root mirror SSDs die, we've got an interim backup).

We'll use zfs-snapshot-mgmt for the regular snapshotting and snapshot deletion, on both zroot and storage. I had planned on having a snapshot named "current" to use with zfs send/receive, but that didn't work if there was another snapshot taken on the destination in the interim. So, I was forced to use rsync. However, for the proper backups (to HDD which is then taken offsite), I will be using zfs send/receive in order to transfer all the snapshots on storage (or rather, zfs-replicate which uses zfs send/receive after I fix a bug in it). If my computer keeps hanging like it's doing I will be forced to rewrite zfs-snapshot-mgmt so that it doesn't use recursive destroy.

I really don't think rsync's the problem, I suspect it has something to do with the zpool scrub or the recursive destroy that's in zfs-snapshot-mgmt. I have seen zfs destroy hang my machine, which I believe is due to the machine running out of memory (kernel virtual address space). It seems to be more stable to destroy each filesystem individually rather than use recursive delete, perhaps because less memory is being used during the former.

I should see some verification soon, if my machine hangs again. I've also used some /boot/loader.conf options that phoenix mentioned in one of his posts. That may help things too.
 
we have been using a zfs server in production for many weeks now and it also ran extensive i/o tests.

the only zfs related thing that caused a hang was using gzip compression, so if you have that enabled try disabling it.
 
chrcol said:
we have been using a zfs server in production for many weeks now and it also ran extensive i/o tests.

the only zfs related thing that caused a hang was using gzip compression, os if you have that enabled try disabling it.

Do you recommend to avoid just gzip or disable compression at all?
We run our pools with lzjb-compression on 8-stable, zfs version 14. and haven't had
any problems yet.
 
lzjb works fine for us, also almost non existant cpu slowdowns as well so seems a far more efficient compression algorithm. So for us lzjb is better for stability and performance, the only possible advantage I see from gzip is saving space, given the downsides better to just buy bigger drives.
 
chrcol said:
lzjb works fine for us, also almost non existant cpu slowdowns as well so seems a far more efficient compression algorithm. So for us lzjb is better for stability and performance, the only possible advantage I see from gzip is saving space, given the downsides better to just buy bigger drives.
I never thought to try that, thanks for the tip. I have been using gzip compression rather extensively.

So far increasing kmem_size_max and disabling the prefetch, along with reducing the scrub frequency seems to have made my machine more stable. 4 days uptime and counting.

I have to wonder if hangs as a result of using gzip, doing big recursive zfs destroys, scrubs etc. are all symptomatic of not having enough memory. i.e. if you have enough RAM and allocate it appropriately you won't see these sorts of problems, and can use ZFS the way it was supposed to be used.
 
ok, i get system reboots as well for large file transfers from one zfs pool to the other. i will change the filesystem as well. after several weeks of debugging and observing the error conditions on this thing, something is fishy with zfs :(
 
edwtjo said:
I missed this.. Well if the kernel segfaults you will not get any output at all. Your reboot is indicative of a panic though.

well good point. let see if i have more luck with geom_raid :)
 
Not sure whether to make this a separate thread, but I was wondering if anyone else is familiar with diagnosing a "hang" or a "freeze" - i.e. you can't move the mouse, the system just sits there and needs to be hard rebooted. /var/log/messages does not show anything. Am thinking of setting up a cron job to just write the time every minute to a file so as to be able to pinpoint the freeze to the nearest minute, maybe it's something zfs related in the crontab that is causing a hang every 4-6 days or so. Better solutions?

Edit: well, so far I've done the following:
# crontab -e
Code:
# For debugging of crashes/hangs/freezes
* * * * * date >> var/log/timelog
 
The timelog is coming in handy already. It hung this morning at 5:45am, which is unusual in that it normally crashes after 4-6 days and not after ~12 hours since a reboot. This is also unusual as I've set zfs-snapshot-mgmt to run every 10 minutes, and my rsync based transfer process to run every hour at 15 minutes past. So I can't pin this on ZFS as yet - nothing unusual happens at the 45 minute mark.

I've just turned all compression where enabled to lzjb (from gzip), in case that is the problem. I guess we'll see what happens now.
 
Back
Top