belon_cfy said:
4 disks server with 1 SSD as cache and log devices, over 1Gbps link.
Educated guess: You are running ZFS, 3+P RAID a.k.a. raidze, with a cache/log disk. First question: Given your hardware, how fast is your file system? Simple test: Find a big file (gigabytes), and read it sequentially with dd to /dev/null. Then repeat the test with 16 or 128 copies of dd, each reading a different large file, and add the results. If you get less than 100 MB/s (which would be pretty slow), then the local file system will be the bottleneck. This will also give you a hint of the sweet spot for the number of threads.
I'm not claiming that sequential reads of large files are representative of your workload, and multiple parallel reads of multiple large files are even less likely to resemble the real world. Think of this as a rough estimate of the high-water-mark, performance that is unlikely to be exceeded.
If you want to do a small IO test, an easy (but pretty inaccurate) way is to use find to walk a directory hierarchy that has lots of small files (source trees tend to be good), and for each file that you find read the file (you can do that with a pipe from find to xargs for dd). Again, test against the local file system first, without NFS in-between. Before we proceed to NFS, we may want to know what the system is capable of without network and protocol overhead.
Ideally, you should be able to sustain a sequential throughput of about 400-500 MB/s (about 100 MB/s per disk drive), and about 400 small IOs per second (100 IOps per disk drive), ignoring cache effects. The reality is probably considerably lower.
Will it be any different on 16 nfs thread compare to 128 ?
Only a test with your real-world workload will know for sure. If you knew exactly what your workload is (video editing, music listening, transaction processing databases, compile of big programs), you could create a synthetic benchmark (a few simple examples are described above) and run those on the NFS clients. Then adjust the number of NFS server threads will measuring the throughput of your benchmark.
Realistically, you won't know very much about your workload (few people do), and even less will be able to reproduce it synthetically. In that case, I suggest either of two approaches. While your normal production workload is running, start a small synthetic test case (read large file or directory hierarchies of small files, or something more realistic), and measure its performance, as you adjust the thread count. Or if you can actually measure the throughput of the real applications (how long does it take to say edit a video clip or compile a program), look at that while you adjust the thread count.
Then pick the value of thread count that works best for your situation.
Having said all that, my (wild-posterior-) guess would be the following: The ZFS cache/log disk will probably absorb a good fraction of the NFS requests, with writes hopefully getting destaged during pauses in the bursty workload. Let's guess that 1/2 of the NFS requests go to the SSD, and that it can handle those IOs (it should, as its IOps for both read and write should be good enough to handle 4 disks). Then you should have a dozen or two dozen IOs per hard disk outstanding. If we arbitrarily pick 16 IOs per disk drive, times 4 drives, times 2 (for half of the IOs being handled by the cache/log SSD), then we arrive at the 128 threads you proposed. Run with that for a while, and if the system is stable and the server doesn't seem overloaded, it's at least a decent starting point.