Unfortunately I can't give figures as the site is still in development and I have no real world data at this time.
And that's the crux of the problem: I can't answer your question, because we need data.
How high is the required bandwidth going to be? A typical YouTube video uses about 2 Mbit/s, or about 1/4 MByte/s. Audio is much lower. Say you have 10 video streams being served at the same time: that's about 2.5 MBytes/s.
Let's assume your "large ZFS server" has half a dozen disks, and a single gigabit ethernet port. Let's assume your web server (where NGINX runs) also has a gigabit port for the internal side. The disks can run at many hundreds of MByte/s. The ZFS server's CPU can keep up with ZFS easily, and can keep up with an NFS server also running on the machine. In this scenario, the bottleneck is likely to be the ethernet connectivity (about 100 MByte/s), which gives you a factor of 40 safety margin compared to your needs. That scenario works even with no caching at all!
BUT: The above has lots of assumptions: What is your workload (serving YouTube 720p quality video), what is your scale (10 streams in parallel), what is your hardware (half dozen disks, dedicated modern CPU for the file server, gigabit ethernet). Any of these assumptions may be wrong by a factor of 2, 3 or 10. If your workload is more intense by a factor of 10, or your hardware weaker (for example a single disk), then the above won't work.
Once you get close to saturating a bottleneck, other things will matter a lot. For example caching, which relies on locality. Now we suddenly need to know the footprint of your workload. For example, if your web server has to serve 1000 streams at the same time (outgoing bandwidth 250 MByte/s), but in practice at any given time only serves a very small set of content (all your viewers really just want to see the most recent Taylor Swift video) and the working set fits into 20 GB of RAM, then any mechanism that allows caching at the video server is viable, and neither network (protocol or hardware) nor file server will be the bottleneck. On the other hand, if the streams being served are completely uncorrelated (each of the 1000 viewers is looking at a different video), then the bottleneck moves to the file server.
Funny anecdote: About 15 years ago, I worked on a
project that was an integrated compute+storage server. It was very powerful for its time, capable of serving thousands of video streams, while being hyper efficient and requiring neither air cooling nor water cooling with exposed water. For our demo, we used to serve 1K video streams simultaneously ... except we only had one video to serve, the chariot race scene from "Ben Hur", which is less than 5 minutes long.