UFS Workload generators - huge performance variant

While using two different I/O generators, I'm seeing a huge performance difference. I'm hoping someone may know why. My guess is, the I/O path through the stack may be vastly different between the two tools.

Drive: PCIe HHHL, 3.2TB NVMe
Link: x4 PCIe 3.0
Driver: FreeBSD 10.3 nvme inbox
Tools used: FIO 2.8 from ports collection, nvmecontrol perftest

When running FIO, which is a common tool used for SNIA-type performance benchmarking, I'm getting approx 40% of data sheet performance on 128K sequential writes. (approx 745 MB/s)

Command: fio --name=test --filename=/dev/nvd0 --rw=write --bs=131072 --time_based --runtime=15000 --write_bw_log=run1.log --direct=1 --thread --numjobs=1 --iodepth=1 --gtod_reduce=0 --refill_buffers --norandommap=1 --randrepeat=0

When I use the nvmecontrol(8), I get FAR better performance. In some instances 110% of specification.

Command: nvmecontrol perftest -n 1 -o write -s 131072 -t 15000 nvme1ns1

Any thoughts on why the 50% delta between the two tools? One thing I noticed was when running FIO, I had to use the block device /dev/nvd0; when running perftest it writes directly to /dev/nvme1ns1

As a side note: when the same FIO test is run on centOS 6.5, the drive meets all performance specifications.
Looks like there may be an I/O engine component in play here, i.e. sync vs libaio in the context of threads/queue depths. Sync seems to prefer much higher thread / QD counts.