Other Looking for ideas for read()/pread() sudden slowdown

Peter Eriksson · Aug 29, 2017

System: FreeBSD 11.0, Dell PowerEdge 730xd with 256GB of RAM and 140TB of disk on a SAS HBA (not hardware RAID).

We are having a number of pretty big file servers that suddenly experience a dramatic slowdown on read() & pread() response time (while reading files in the root filesystem - *not* on the ZFS data disks, different zpool).

These servers have tens of thousands of filesystems and 100,000 users (which we store in local DB databases under /var/db in order to speed things up).

Normally things work smoothly and (for example) doing a "ls -l /export/students" runs in around 2-3 seconds. But every now and then during the busy hours of the day it takes minutes instead.

Running a "truss -D" on the "ls -l" process shows that suddenly read()/pread() takes 2-3ms instead of 0.01-0.02 ms - and thus everything grinds to a virtual halt...

The system "load" number (as seen with "top") doesn't indicate anything extreme. zpool iostat doesn't show anything extreme either.

> cat /tmp/truss.out | tr '(' ' ' | awk '{N[$2]++; T[$2]+=$1} END { for (t in T) { printf "%f s\t%d\t%-30s\t%f ms\n", T[t], N[t], t, T[t]*1000/N[t] }}' | sort -nr

gives for a fast system (Total time, ncalls, syscall, time/call):

0.179025 s 10300 pread 0.017381 ms
0.102503 s 10831 fstat 0.009464 ms
0.096788 s 6204 read 0.015601 ms

and for when it's slow:

19.197472 s 6204 read 3.094370 ms
17.227360 s 10300 pread 1.672559 ms
0.101685 s 10831 fstat 0.009388 ms

We are looking for ideas where to look for possible causes. And knobs we can tune?
We tried moving parts of the files (our DB-passwd/groups) to /tmp (tmpfs-mounted) but it still takes around the same time.

- Peter

SirDice · Aug 29, 2017

zpool iostat isn't going to tell you anything when the slowness doesn't happen on the ZFS pool but on the UFS filesystems. Have a look with iostat(8). It sounds to me like it's just getting hammered with I/O requests (more than 80% busy).

Peter Eriksson · Aug 29, 2017

SirDice said:
zpool iostat isn't going to tell you anything when the slowness doesn't happen on the ZFS pool but on the UFS filesystems. Have a look with iostat(8). It sounds to me like it's just getting hammered with I/O requests (more than 80% busy).

The system(s) have root on (another separate) zpool... No UFS filsystems on these systems

One idea we have is that some generic buffer cache getting full (since reads from the tmpfs /tmp also is slow), but where to look...

SirDice · Aug 29, 2017

Ah, I misunderstood the "not ZFS data disks" as being UFS

But if all the pools are on the same controller the load of one pool could of course influence other pools. There's only so much data you can push through a bus. If one pool hogs all the bandwidth the other pools are going to suffer too.

ralphbsz · Aug 29, 2017

Peter Eriksson said:
... with 256GB of RAM and 140TB of disk ...

That's a lot of RAM, which is good. Most likely it is being used (automatically and sensibly) as buffer cache for data that has already been read before. Obviously, file system metadata (like directory content), which is more frequently used, makes particularly good candidates for reading.

Also, I assume that your 140TB of disks are *spinning* disks, not SSDs; you would probably have mentioned if your system had 140TB of SSD.

... suddenly read()/pread() takes 2-3ms instead of 0.01-0.02 ms ...

A read operation that goes all the way to the spinning disk will typically take a few milliseconds, unless it is a completely sequential read. Just the rotational latency (waiting for the platter to spin to the correct position) on a 10K RPM disk is on average 3ms (slower on 7200 or 5400 RPM disks, and you are probably not using 15K RPM disks or SSDs), and to that you have to add seek time and the actual transfer time. If your read() calls take 2-3ms in the slow state, that this is actually good, because it means that you're getting some advantage over purely random workload (so there is some sequential component to it), and not getting very much seek time (so you are short-stroking the disk, not seeking over the whole platter).

But if your expectation is 0.01-0.02ms, then your expectation is purely buffer cache hits, which would be about that fast (they don't involve disk hardware, and are purely a CPU/memory operation).

So the answer has to be that for some reason during the slow periods, the buffer cache gets purged. Now, what could the reasons for that be? I have no idea about ZFS internals (since I work on competing file system products, I have never looked at the ZFS source code or internal structure). But one common reason could be that there is suddenly memory pressure. So here's my suggestion: Observe what the system does right *before* your ls operation gets slow. First hypothesis: Could it be that some process starts that uses a lot of memory, and the operating system has to purge clean buffer cache data to get enough space? Second hypothesis: Could it be that some process is doing a lot of file system IO, and no process is using the directory metadata that's needed for your ls operation, so it gets aged out of the cache?

I know that I'm asking you to answer a difficult question: Knowing what happens right *before* the slowdown requires you to monitor everything that happens on the system and remember it, and then analyze the recent past when the problem occurs. I would look for tools to monitor memory usage, and aggregate IO. The simplest ist vmstat, but you'll probably have to find ZFS-specific ones too.

ralphbsz · Aug 30, 2017

I had another thought. This comes from having dealt with a bug with similar effects on Linux years ago. Your machine is certainly an SMP, probably with multiple sockets. That means memory is attached to different sockets. If a process is running on a CPU in one socket, but mostly using memory that's attached to a different socket, that causes a minor performance hit (memory access through the connection between sockets is somewhat slower). That's called NUMA, or non-uniform memory access. Modern operating systems are all NUMA aware, and allocate memory "correctly", and schedule processes to run on a CPU in the "correct" socket. I put the word "correct" here in quotes, because all this is based on heuristics, which occasionally get fooled by extraordinary workloads, but on average work well. And average throughput is what matters to most people.

The bug I dealt with was that when memory allocation is very uneven (for example, a single process allocates 300 or 400 GB using a single thread from a single CPU, and keeps that memory in use), yet CPU usage is very even (that one process later splits into thousands of threads, which use all CPU resources), the NUMA-aware scheduling code went nuts, and wasted all its time trying to manage very long lists of memory allocation. That caused context switches to take "forever" (many ms, and in one case over 10 minutes). And since every IO involves multiple context switches, it turned an IO bound workload into a CPU-bound nightmare. Again, this bug was in Linux not FreeBSD, but the whole area of memory management and scheduling on multi-socket machines is very complex.

If your hardware enables numa (you should see chatter about "domains" in dmesg when booting), then it might make sense to play with the numactl tool a little bit, to see whether everything is nice and smooth. I have no experience with NUMA on FreeBSD though, you'll have to find deeper expertise elsewhere.

Peter Eriksson · Aug 30, 2017

> Second hypothesis: Could it be that some process is doing a lot of file system IO, and no process is using the directory metadata that's needed for your ls operation, so it gets aged out of the cache?

That's my main theory right now. We are using rsync to backup the filesystems so very likely a lot of other data is being read and probably using up the buffer cache. Hmm.. I wonder if using zfs snapshots + send & receive would put less of a burden on the buffer cache.

I wonder if there could be some way to prioritize the buffer cache somehow. It would be nice to somehow know that important files in /etc - that get's read a lot - would not be dropped from the cache that quickly. Probably not

(I'm even considering looking at rewriting the our local passwd/group nsswitch DB database code to avoid reading from files that often - perhaps go for a shared memory segment locked into RAM instead - but that would just (perhaps) speed up that one particular part of our system - there are bound to be other parts also being slow).

ralphbsz · Aug 31, 2017

Peter Eriksson said:
I wonder if there could be some way to prioritize the buffer cache somehow. It would be nice to somehow know that important files in /etc - that get's read a lot - would not be dropped from the cache that quickly. Probably not

There are literally thousands of research papers on how to manage caches. It's hard. Having applications give hints to the caching layer has been tried, and in general it doesn't work (with exceptions).

(I'm even considering looking at rewriting the our local passwd/group nsswitch DB database code to avoid reading from files that often - perhaps go for a shared memory segment locked into RAM instead - but that would just (perhaps) speed up that one particular part of our system - there are bound to be other parts also being slow).

In a nutshell, you are proposing to implement your own cache: You'll have one process use lots of memory and store data, to prevent the file system from having to do so. It's a standard trick. Might work great, because you have more knowledge about future access patterns than the heuristics in the file system do. Might be a nasty idea and make it even slower, if you end up competing for memory with the buffer cache = file system, and being less efficient. Definitely try it, but be prepared to accept that you re-invented the wheel, except this time you got it triangular, and are going back to the standard wheel.

A friend of mine wrote one of the standard papers in this field, and having a sense of humor, he titled it "My cache or yours". That's exactly the problem you're facing: Multiple competing caches, good or bad?

talis · Sep 30, 2017

Whilst each BSD Server implementation differs wildly (per human preference) there is a glaring lack of "standard benchmarking" to pass highly loaded http// PULL requests within a core design. OK thats very simplistic.. here we all are neck deep in feature code (some of which promises to deliver) Node.js for 1

Competing Cache ? the problem is too much garbage on the Cache. To begin, we have ELF and the older Aout as separate caches the latter needs to emulate via XFree86. Why do i want that? i dont, and if i need to write my own parsers, to spawn kernel macros ,i shall, short of using outdated crud, trying to be all things.

The above scenario reminds very much of legacy kernel issues that havent been swept clean (backward compatiility) throwing fear into embracing FreeBSD as a http// Server , simply as a consequence to the above issues, obviously ought not be manifesting on a seasoned DB application. the My cache or yours, throws fear into venturing into legacy UNIX that holds smoking guns to the unsuspecting, esp over something SO BASIC as mem cache provisioning.. Why is this a black art in 2017 ( or even 1999 ) i dont need an OS that becomes nervous on blocking calls for each DB access .. my own scenario is DB in RAM across 8 nodes ( 16 CPUs ) each node has 64GB with a FAS 3210 backup/ restore HD array,

As there is enough RAM to keep 12months records, a first issue is why the heck must the 3210 stay on 24/7 when it really needs to come on once a week for a backup, then keep its head down. Nervous. Why?

My choice with IB switching between a pair of http// session managers to use kernel bypass ConnectX3 OFED standard to access data (each incoming header holds full addressing) each PULL yields 4 packet return (4 x 64byte) .. that makes things interesting, as to loading up though the IB database. IS there a way to store each packet in native form Datagram, or must it transcode to JSON (or worse XML) i am a hardware eng. oh well a concerted edit effort expired its session (no warning) really am not going to go into Cache binaries all over again after 1 hour timewaste.. in short the Ldconfig builds a library cache at boottime and many older Aout binaries are part of FBSD. Netscape / SOCKS modules have Aout binaries.. Where is a comprehensive list ? and what overhead does XFree86 compatibility bestow ? Why do i need an emulator to run with ELF and what load Rtld imposes on the 229 binaries ?

Finally how do shared OFED libraries integrate with legacy binaries, where no mention of integration in ConnectX manuals exist ? Is this a guessing game?

http://www.mellanox.com/related-docs/prod_software/Mellanox_FreeBSD_User_Manual_v2.1.5.pdf

why is the BSD manual so incomplete (20pages comp to 269 pages for Linux?
http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_User_Manual_v4_1-1_0_2_0.pdf

i plan 2 session Servers (to keep over 100,000 http// sessions active) more like 500,000 would someone be so kind as to recomm a most ideal source for such management and a program to load up a FreeBSD Server pair (the program must reside on a separate Server that sends PULL requests at a rate of 4 packets per Sec per client.. so i can verify the loading between any number of client requests
each request, sends 4 packets and receives 4 packets (why not vary the packet number, and datagram for flexibility) both TCP and UDP options.. how did above issues raise without Cache exploratory (where is the toolbox? lost in Userspace
One would believe this to be a standard out of the box setup .. why UNIX enjoys to monkey around in this day must be in the rising cost of bananas..

Other Looking for ideas for read()/pread() sudden slowdown

Peter Eriksson

SirDice

Administrator

Peter Eriksson

SirDice

Administrator

ralphbsz

ralphbsz

Peter Eriksson

ralphbsz

talis