Solved Kernel panic on armv7 with qemu

oOiOo · May 5, 2023

I'm trying to run a generic armv7 image (13.2) on an amd64 host (13.2) and I get a panic with the network interface: To start the VM, I use the following command:

 qemu-system-arm -M virt -m 2048m -nographic -nic tap -bios /usr/local/share/u-boot/u-boot-qemu-arm/u-boot.bin -hda FreeBSD-13.2-RELEASE-arm-armv7-GENERICSD.img

The tap interface is connected to the bridge of the host, the DHCP negotiation is OK, but as soon as I use the network ( pkg install for example), I get a panic:

Code:

root@generic:~ # pkg install git
The package management tool is not yet installed on your system.
Do you want to fetch and install it now? [y/N]: y
Bootstrapping pkg from pkg+http://pkg.FreeBSD.org/FreeBSD:13:armv7/quarterly, please wait...
Fatal kernel mode data abort: 'Alignment Fault' on read
trapframe: 0xd5ecea60
FSR=00000001, FAR=e11eb01a, spsr=20000013
r0 =00000000, r1 =00000001, r2 =00000001, r3 =d5eceb4c
r4 =00000014, r5 =d91eb800, r6 =e11eb02e, r7 =000001f7
r8 =00000000, r9 =000001f7, r10=e11eb01a, r11=d5eceb90
r12=35006f01, ssp=d5eceaf0, slr=c04a9728, pc =c04a9750

panic: Fatal abort
cpuid = 0
time = 1680846870
KDB: stack backtrace:
#0 0xc035786c at kdb_backtrace+0x48
#1 0xc02fdd20 at vpanic+0x140
#2 0xc02fdbe0 at vpanic+0
#3 0xc06304ac at abort_align+0
#4 0xc063052c at abort_align+0x80
#5 0xc063017c at abort_handler+0x480
#6 0xc060f480 at exception_exit+0
#7 0xc04a9750 at udp_input+0x288
#8 0xc0473f54 at ip_input+0x1e0
#9 0xc04447c0 at netisr_dispatch_src+0xf8
#10 0xc043bf2c at ether_demux+0x1a4
#11 0xc043d5e4 at ether_nh_input+0x480
#12 0xc04447c0 at netisr_dispatch_src+0xf8
#13 0xc043c404 at ether_input+0x50
#14 0xc01c0838 at vtnet_rx_vq_process+0x880
#15 0xc01b70d0 at vtpci_intx_intr+0xac
#16 0xc02b87f0 at ithread_loop+0x2ec
#17 0xc02b465c at fork_exit+0xc0
Uptime: 2m25s
 
 
U-Boot 2023.01 (Apr 03 2023 - 20:47:58 +0000)

DRAM:  2 GiB
Core:  47 devices, 13 uclasses, devicetree: board
Flash: 64 MiB

Loading Environment from Flash... *** Warning - bad CRC, using default environment

I'm not familiar with Qemu, maybe I'm missing something.

Regards

_martin · May 6, 2023

It's been some time I worked with ARM. I confirm I can reproduce this in my VM with the same image. I tried the virtio drivers too ( -device virtio-net-pci,netdev=network0 -netdev tap,id=network0,br=br0) but it's failing in the same function: udp_input+0x288

Alignment fault would suggest a problem with the natural alignment of a data.
I connected the gbd to the booting VM and saw this:

Code:

Breakpoint 2, udp_input (mp=<optimized out>, offp=<optimized out>, proto=17) at /usr/src/sys/netinet/udp_usrreq.c:504
504    /usr/src/sys/netinet/udp_usrreq.c: No such file or directory.
=> 0xc04a9750 <udp_input+648>:    03 00 9a e8    ldm    r10, {r0, r1}
   0xc04a9754 <udp_input+652>:    00 20 a0 e3    mov    r2, #0
   0xc04a9758 <udp_input+656>:    08 30 da e5    ldrb    r3, [r10, #8]
(gdb) i r $r10
r10            0xd8f1901a          -655257574
(gdb)

(gdb) i r cpsr
cpsr           0x20000013          536870931
(gdb)

Which would indicate thumb mode is disabled. Memory access is on 0xd8f1901a which is not mod 4 and hence explain the exception.

Maybe data obtain while in thumb mode ? Just a wild guess from my side..
I think opening a PR would be a way to go here.

_martin · May 6, 2023

I opened PR 271288 for this. I was able to reproduce this without tap interface. Crash happened elsewhere but still it was on unaligned data structures.

_martin · May 7, 2023

I did some more testing. It seems the address of struct ip* in the mbuf used for given connection is not aligned. I don't understand the internal process of how the buffer gets allocated and used for given connection.
But I wondered if this is driver related issue or not. It seems it's virtio-net related. When I start my VM like this:

qemu-system-arm -M virt -m 2G -hda FreeBSD-13.2-RELEASE-arm-armv7-GENERICSD.img -s -bios u-boot.bin -device rtl8139,netdev=network0 -netdev user,id=network0

I was not able to trigger the crash (yet). So it's not a solution but at least you can use the VM with network this way.

Phishfry · May 7, 2023

I was going to suggest trying e1000 but I see you have realtek.

_martin · May 7, 2023

Phishfry said:
I was going to suggest trying e1000 but I see you have realtek.

It was actually my first pick too but there's no driver for it (if_em) on the image provided. Not sure if that's arch dependency issue, didn't dive deep there. But as I saw the driver for (well don't want to say good) old rl8139 I went with that during my PoC tests.

I'm stressing the VM and it's still running. So maybe it's a valid workaround then.

covacat · May 8, 2023

i have such a VM created 2 years ago
it used to work ok with qemu 4.2 or 5.0 (can't remember which i used)
now it bombs with 6 and 7.2
LE
the image runs 13.0-R

oOiOo · May 8, 2023

_martin said:
I did some more testing. It seems the address of struct ip* in the mbuf used for given connection is not aligned. I don't understand the internal process of how the buffer gets allocated and used for given connection.
But I wondered if this is driver related issue or not. It seems it's virtio-net related. When I start my VM like this:

qemu-system-arm -M virt -m 2G -hda FreeBSD-13.2-RELEASE-arm-armv7-GENERICSD.img -s -bios u-boot.bin -device rtl8139,netdev=network0 -netdev user,id=network0

I was not able to trigger the crash (yet). So it's not a solution but at least you can use the VM with network this way.

Thanks a lot for your help. It works with slow performance but it works

_martin · May 8, 2023

covacat said:
it used to work ok with qemu 4.2 or 5.0 (can't remember which i used)

I didn't have much time (prolonged weekend here), I tested handful of combinations though. From BIOS (edk2 and u-boot), 13.1 and 13.2 (13.1 didn't even boot) and few qemu instances. I don't have as old as 4.x, might be worth a try.

But.. open brainstorming question: why would it be qemu-dependant ? This is a problem on guest side. At quick glance at virtio-net I was not able to figure out which part should I focus on.

I wanted to try this on real HW, on my rpi3. I do have j-link so debugging would be easily possible. Problem is I was not able to make it boot, even with the rpi3 uboot (was stuck in rainbow screen).

oOiOo said:
Thanks a lot for your help. It works with slow performance but it works

Yw. Yeah speeds are not great with that.

While nudge from a developer would be nice here I'll try to look around more.

covacat · May 8, 2023

i created the vm in order to build stuff for a rpi0
wasnt able to boot any armv6 just a v7 so in the end i patched the kernel to report v6 to the ports / package system
i had no problem fetching ports/packaging with virtio

qemu - run / boot freebsd arm 32bit image in qemu

anybody succedeed in running an arm6 or arm7 freebsd 13 image in qemu ? i managed to run until the kernel boots (so loader works) but i get no output from the kernel i can boot an aarch64 image but not a 32 bit one i use qemu system arm/aarch64! and u-boot qemu / or qemu_efi

forums.freebsd.org

also i looked at the diffs between virtio-net drivers between 13.0 and 13.2 and saw no material diff

_martin · May 8, 2023

covacat said:
it used to work ok with qemu 4.2 or 5.0 (can't remember which i used)

I found several topics and bug reports on qemu where it failed to behave according to ARM specification when it comes to aligned memory access. This would explain why it worked before and may not be working now.

When it comes to unaligned data access this is what ARM docs say about it. ldm always produces alignment fault.
I found one example with possible solution to it. That being said the "udp" part of the code is not a problem when RL driver is used as m_data address is properly aligned.

_martin · May 10, 2023

I wrote a patch that makes it possible to use vtnet drivers under arm7 again. Attached in PR.
I've tested in on two servers, worked without a problem. If you feel like testing I'd appreciate that.

oOiOo found another PR 257987 with very similar backtrace. As that PR lacks disassembly of the frame where it failed it's hard to say.

When it comes to a "fix" itself. I'd like to hear from the "real" OS devs on this subject. While other drivers such as fxp,re do actually use the very same approach to fix the alignment issue I consider that as a hack.

covacat · May 10, 2023

bcopy is expanded by a #define to __builtin_memmove and for whatever reason it makes assumptions about alignment

_martin · May 10, 2023

I'd assume low level function don't check these as performance would suffer. It should be caught higher in code.

I wanted to simulate these conditions in my own C code but failed to do so. Or maybe I didn't try hard enough.

On the other hand triggering such exception in assembler is trivial:

Code:

.section .text
    .globl    _start, main

_start:
main:
    ldr r1, =arr
    add r1, #2
    ldm r1, {r2,r3}

    mov r0, #1
    mov r1, #42
    svc #0

.section .data
    arr:    .word    0xcafe, 0xbabe

I thought it's up to compiler to catch these and generates a code around that. But why is it not done here I don't know.

I'd definitely like to hear from ARM dev to share some wisdom how such code should be done in the first place (i.e. always code like alignment matter).

covacat · May 10, 2023

C:

char dest[19];
static int lkm_event_handler(struct module *module, int event_type, void *arg) {

  int retval = 0;                   // function returns an integer error code, default 0 for OK
  char src[19];
  switch (event_type) {             // event_type is an enum; let's switch on it
    case MOD_LOAD:                  // if we're loading

      bcopy(src,dest,9);
      uprintf("LKM Loaded\n");      // spit out a loading message
      break;

    case MOD_UNLOAD:                // if were unloading
      bcopy(src + 1,dest,9);
      uprintf("LKM Unloaded\n");    // spit out an unloading messge
      break;

    default:                        // if we're doing anything else
      retval = EOPNOTSUPP;          // return a 'not supported' error
      break;
  }

  return(retval);                   // return the appropriate value

}

cpp output

C:

char dest[19];
static int lkm_event_handler(struct module *module, int event_type, void *arg) {

  int retval = 0;
  char src[19];
  switch (event_type) {
    case MOD_LOAD:

      __builtin_memmove((dest), (src), (9));
      uprintf("LKM Loaded\n");
      break;

    case MOD_UNLOAD:
      __builtin_memmove((dest), (src + 1), (9));
      uprintf("LKM Unloaded\n");
      break;

    default:
      retval = 45;
      break;
  }

  return(retval);

}

objdump -D -S -j .text freebsd_lkm.o (unaligned)

Code:

  case MOD_LOAD:                  // if we're loading
      bcopy(src+1,dest,9);
  24:    e2801001     add    r1, r0, #1, 0
  28:    e59f0024     ldr    r0, [pc, #36]    ; 54 <lkm_event_handler+0x54>
  2c:    e3a02009     mov    r2, #9, 0
  30:    ebfffffe     bl    0 <memmove>
  34:    e59f001c     ldr    r0, [pc, #28]    ; 58 <lkm_event_handler+0x58>
  38:    ea000000     b    40 <lkm_event_handler+0x40>

bjdump -D -S -j .text freebsd_lkm.o (aligned)

Code:

    case MOD_LOAD:                  // if we're loading
      bcopy(src,dest,9);
  20:    e59f002c     ldr    r0, [pc, #44]    ; 54 <lkm_event_handler+0x54>
  24:    e5dd1008     ldrb    r1, [sp, #8]
  28:    e89d000c     ldm    sp, {r2, r3}
  2c:    e1c020f0     strd    r2, [r0]
  30:    e5c01008     strb    r1, [r0, #8]

_martin · May 10, 2023

covacat said:
bcopy is expanded by a #define to __builtin_memmove and for whatever reason it makes assumptions about alignment

Right. As you shown above it does work as expected. And that's why I was not able to reproduce this in my C program.

Now the code in udp_input() is clear -- linker assumes m_data is always aligned.
I've lots of other stuff to do today but I'd like check this further as I can't reproduce it.

covacat · May 10, 2023

its most likely the case of what's in https://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html

oOiOo · May 11, 2023

_martin,

I just recompile a 13.2-RELEASE kernel and it works nice for me. Network speed seems to be better (65KiB/s vs 27KiB/s) with ftp

Thanks again for your patch.

_martin · May 11, 2023

Ok, thanks for confirmation. Just please note that such slow speeds are not because of the driver (or its modification). In my tests yesterday I was able to achieve 25MB/s speeds for this VM using virtio drivers.

Btw. I put KASSERT() there just to make sure we are not subtracting more than we can even though it's unlikely size will be that small.

_martin · May 11, 2023

I don't know what I was thinking with that KASSERT() I did there. I updated the patch and uploaded to PR.

oOiOo · May 11, 2023

_martin said:
Ok, thanks for confirmation. Just please note that such slow speeds are not because of the driver (or its modification). In my tests yesterday I was able to achieve 25MB/s speeds for this VM using virtio drivers.

I tested quemu-arm on amd64 VM (KVM/virtio). I just tested on real hardware, and it's much better (~ 20MB, it's an old machine). It is now usable.
Thanks again.