Yup, I'm having the same/very-similar issue randomlay as well. Running 8.2-RELEASE, real system not vmware or other virtual.
I can reproduce it a few times, then it doesn't reproduce for another few times. It isn't specific to either master or slave either. Very odd.
Basically what appears to be happening is that whenever either the Master or the Slave change hast roles to "init", the worker processes on the init-role system exit and the worker processes on the other (primary or secondary) attempt to restart themselves, and sometimes fail.
For instance. On my Master, I start with this:
Code:
nas1# ps -p `pgrep hastd`
PID TT STAT TIME COMMAND
40343 ?? Ss 0:00.02 /sbin/hastd
40563 ?? I 0:00.01 hastd: ada0 (primary) (hastd)
40564 ?? I 0:00.01 hastd: ada1 (primary) (hastd)
40565 ?? I 0:00.01 hastd: ada2 (primary) (hastd)
40566 ?? I 0:00.01 hastd: ada3 (primary) (hastd)
40567 ?? I 0:00.01 hastd: ada4 (primary) (hastd)
40568 ?? I 0:00.01 hastd: ada5 (primary) (hastd)
40569 ?? I 0:00.01 hastd: ada6 (primary) (hastd)
40570 ?? I 0:00.01 hastd: ada7 (primary) (hastd)
40571 ?? I 0:00.01 hastd: ada8 (primary) (hastd)
40572 ?? I 0:00.01 hastd: ada9 (primary) (hastd)
40573 ?? I 0:00.01 hastd: ada10 (primary) (hastd)
40574 ?? I 0:00.01 hastd: ada11 (primary) (hastd)
40575 ?? I 0:00.01 hastd: ada12 (primary) (hastd)
I then issue [CMD="nas2#"] hastctl role init[/CMD] on the slave. Checking the Master again I see that the worker PID's have changed for most of them:
Code:
nas1# ps -p `pgrep hastd`
PID TT STAT TIME COMMAND
40343 ?? Ss 0:00.04 /sbin/hastd
40572 ?? I 0:00.01 hastd: ada9 (primary) (hastd)
41432 ?? I 0:00.00 hastd: ada3 (primary) (hastd)
41435 ?? I 0:00.00 hastd: ada12 (primary) (hastd)
41444 ?? I 0:00.00 hastd: ada11 (primary) (hastd)
41447 ?? I 0:00.00 hastd: ada10 (primary) (hastd)
41456 ?? I 0:00.00 hastd: ada8 (primary) (hastd)
41459 ?? I 0:00.00 hastd: ada7 (primary) (hastd)
41468 ?? I 0:00.00 hastd: ada6 (primary) (hastd)
41471 ?? I 0:00.00 hastd: ada5 (primary) (hastd)
41480 ?? I 0:00.00 hastd: ada4 (primary) (hastd)
41483 ?? I 0:00.00 hastd: ada2 (primary) (hastd)
41492 ?? S 0:00.00 hastd: ada1 (primary) (hastd)
41495 ?? S 0:00.00 hastd: ada0 (primary) (hastd)
This is fine. As when I start the Slave back up as a hast secondary, everything comes back to life. On the other hand, if I'm not lucky, I get this sometimes.
Code:
nas1# ps -p `pgrep hastd`
PID TT STAT TIME COMMAND
6967 ?? Is 0:00.18 /sbin/hastd
9436 ?? I 0:00.00 hastd: ada12 (primary) (hastd)
9437 ?? Z 0:00.00 <defunct>
9447 ?? Z 0:00.00 <defunct>
9448 ?? Z 0:00.00 <defunct>
9449 ?? Z 0:00.00 <defunct>
9450 ?? Z 0:00.00 <defunct>
9460 ?? Z 0:00.01 <defunct>
9461 ?? I 0:00.00 hastd: ada11 (primary) (hastd)
9471 ?? I 0:00.00 hastd: ada10 (primary) (hastd)
9472 ?? Z 0:00.00 <defunct>
9483 ?? Z 0:00.00 <defunct>
9484 ?? Z 0:00.00 <defunct>
(this is from a different session, so PID's are not relevent here)
Now, when this happens and the worker processes go zombie, the only way to fix it, is to do [CMD=""]kill -9 `pgrep hastd`[/CMD] and then restart the hastd service again. Tried waiting to see if they'd clean themselves up, but they just hang around. In this case, hastd is unresponsive to service restarts and hastctl commands just hang.
If this happens on the slave, it's no big to kill them all and start it up again. But when it happens on the master, I loose my storage for the time it takes to restart hastd and it's procs. My zpool will be very unhappy in that case!
All this testing is purely manual, but I noticed most prolifically when using really simply failover scripts with CARP/DevD.
Will try to post more if I figure anything new out.
Update: a couple things I'm noticing which DO NOT seem to ever result in hast worker procs turning to zombies and thus breaking replication: First, if I do a [CMD="host#"]kill -9 `pgrep hastd`[/CMD] on the slave, the hastd worker procs on the master restart themselves successfully each time. If I do it on the master, the hastd worker procs on the slave exit gracefully, no zombies. This effectively simulates a real failure of some kind. Second, when I want to gracefully switch roles, if I change the master hast resources to "secondary" first [CMD="nas1#"]hastctl role secondary all[/CMD], so I have two secondaries, then change the slave hast resources to primary [CMD="nas2#"]hastctl role primary all[/CMD], role transition is smooth, nothing breaks. I can't reproduce the above errors. I'm still doing all this manually, so haven't hammered at it. What I'm thinking then is that there is something up with the "init" role (I know, why was he doing that!), and by not using it, the issues have apparently abated.
Update2: So, that didn't solve all of it. I setup a simple script for CARP/DevD and the Slave crashed hard with a page fault. I've been getting this regularly. Here's what I get on the screen:
Code:
processor eflags = interrupt enabled, resume, IOPL
current = process39498 (hastd)
trap number = 12
panic: page fault
cpuid = 3
KBD: stack backtrace:
#0 0xffffffff805f4e0e at kbd_backtrace+0x5e
#1 0xffffffff805c2d07 at panic+0x187
#2 0xffffffff808ac600 at trap_fatal+-x290
#3 0xffffffff808ac9df at trap_pfault+0x28f
#4 0xffffffff808acebf at trap+0x3df
#5 0xffffffff80894fb4 at calltrap+0x8
#6 0xffffffff8054cebd at devfs_ioctl_f+0x7b
#7 0xffffffff806043c2 at karn_ioctl+0x102
#8 0xffffffff806045fd at ioctl+0xfd
#9 0xffffffff80600dd5 at syscallenter+0x1e5
#10 0xffffffff808aca5b at syscall+0x4b
#11 0xffffffff80895292 at Xfast_syscall+0xe2
uptime: 20hrs22m35s
Cannot dump. Device not defined or unavailable
Automatic reboot in 15 seconds - press a key on the console to abort
panic: bufwrite: buffer is not busy???
cpuid = 3
Unfortunately, the system does not restart. It's completely hosed until a hard reset is performed.
Here's my DevD script:
Code:
nas1# cat /etc/devd/carp.conf
notify 10 {
match "system" "IFNET";
match "subsystem" "carp0";
match "type" "LINK_UP";
action "/usr/local/bin/role-switch.sh master";
};
notify 10 {
match "system" "IFNET";
match "subsystem" "carp0";
match "type" "LINK_DOWN";
action "/usr/local/bin/role-switch.sh slave";
};
Here's my role switching script:
Code:
#!/usr/local/bin/bash
hast_role_change()
{
# log what we're doing
logger -p local0.debug -t hast "Attempting role change to $1."
# allow worker procs on old primary to exit gracefully before changing roles
if [ $1 = "primary" ]; then
sleep 30;
fi
# change role
hastctl role $1 all
# Check exit status for attempted role change
if [ $? -ne 0 ]; then
logger -p local0.debug -t hast "Unable to change HAST role to $1. Aborting cluster role change."
exit 1;
else
logger -p local0.debug -t hast "HAST role change to $1 completed successfully."
fi
}
# log cluster role change request
logger -p local0.debug -t cluster "Role change request: $1"
case "$1" in
master)
# change role from slave to master
logger -p local0.debug -t cluster "Attempting role change to $1."
# Change role to primary for all
hast_role_change primary
;;
slave)
# switch role from master to slave
# change role from slave to master
logger -p local0.debug -t cluster "Attempting role change to $1."
# Switch roles for the HAST resources
hast_role_change secondary
;;
esac
# log cluster role change success
logger -p local0.debug -t cluster "Role change to $1 completed successfully."
I've seen other scripts where there are checks for worker procs and I intend on adding thouse, but for this testing in my simple setup, a manual transition takes no time at all for worker processes to exit. So, sleeping for 30 seconds (meaning there are two secondaries for a short time) should be adequate before a secondary is promoted to master.
Now, these page fault errors are pretty consistent. I can sometimes get one or two graceful role transitions, but by the 3rd or 4th (and again, I'm waiting for datatransfers to complete), the master or slave will inevitably crash.
One more thing, I'm testing failover just by bringing down the carp iface [CMD="host#"]ifconfig carp0 down[/CMD]
Another Update:
So, it seems that this issue is related to CARP/DEVD. I tested the role-switch.sh script above manually by having a terminal open on both systems, then executing the script simultaneously, supplying "master" for one, "slave" for the other. I could without fail change roles over and over with no errors. Then I tried again using CARP and DevD triggers. The Master node crashed with a page fault on the first role transition.
On other test, I used just a simple logging script, to say "changing role to" but not actually doing anything with the hast roles. For those tests, there were no issues. I can bring a carp interface down, and the peer will come up, and both will log the events.
Sorry if this post is a bit long. Not sure if it's better to post smaller one or just update the ongoing before I get replies.
Another Update: Yeah, even the script randomly causes page faults. I'm at my wits end on this. It seems so simple ...sigh. I really have no clue what's going on.
I would greatly appreciate any suggestions.
Thanks!