"yes | pv > /dev/null" performance issue on FreeBSD in C-code

I ran yes | pv > /dev/null on Linux and FreeBSD

Results:
Linux - [4.06GiB/s]
FreeBSD - [24.8MiB/s]

Any idea?
That experiment came from yes command in Nim

So I tried nim code in FreeBSD:
Nim - [2.14GiB/s]

It is worse, than I expected. It doesn't matter much for that particular program, but if every single component written the same way,
then we all have performance penalty for nothing. Just imagine 100x slower, than the same could be done on the same hardware.
 
For testing things, I use a FreeBSD 11.1 installation in VirtualBox on a Mac mini. So, please don't expect paramount performance. Anyway blockyes.c:
C:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define block_size (1024*1024)

int main(int argc, const char *argv[])
{
   char *ys; memset(ys = malloc(block_size), 'y', block_size);
   for (;;)
      fwrite(ys, block_size, 1, stdout);
   return 0;
}
clang -g0 -O3 blockyes.c -o blockyes
./blockyes | pv > /dev/null
60GiB 0:00:10 [ 463MiB/s]

With the yes command I also saw only about 20Mib/s.
 
In the meantime I deleted my version, because I used fwrite(3) instead of write(2), her comes the corrected one:
blockyes.c
C:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#define block_size (1024*1024)

int main(int argc, const char *argv[])
{
   char *ys; memset(ys = malloc(block_size), 'y', block_size);
   for (;;)
      write(1, ys, block_size);
   return 0;
}

clang -g0 -O3 blockyes.c -o blockyes
./blockyes | pv > /dev/null

4GiB 0:00:07 [1.64GiB/s]

3.5times the performance of the fwrite() version with FreeBSD 11.1 in the VirtualBox of my Mac mini.
 
Thank you for addressing it!

My first thought was I did a mistake while recompiling world and kernel on my particular machine.
Then I understood the code has the issue. Though it doesn't really matter.

As I added later it might be everywhere where buffered and aligned output outperform a simple and naive solution
by factor of 100. I guess it's a bit too much for being ignored.

Yes case itself is trivial and not of high importance.

P.S. It's quite remarkable, that tricks can bring more, than any compiler can optimize.
That is probably the reason why ASM will remain the fastest language on earth.
 
3.5times the performance of the fwrite() version with FreeBSD 11.1 in the VirtualBox of my Mac mini.

Thank you for your work!

I've made the same observation on Linux. If you start to use standard stdio.h functions for dealing with files your speed is far behind
of any system utility written in the same C language. Once you start to use kernel calls like write() directly you can get decent performance.

Why the hell all these snail functions have been written in the first line? And what the mad person put them into standard libc?

99% of people don't go that far just to open and read the file. Instead they use python to make it slowly interpretable...
 
Why the hell all these snail functions have been written in the first line? And what the mad person put them into standard libc?
What is the most precious resource today on computers?

Is it CPU time? No. Otherwise there would be no software written in Java or Python, which do run slower than the best-optimized C code. Let me tell you a secret: if you look at very large systems, like "big data" analytics clusters, which usually cost dozens of millions of dollars, their software is written in ... Java and Python. Matter-of-fact, look at Hadoop sometime, and look at the overall efficiency of a Hadoop cluster: It's awful! The people who use this system could save millions of dollars by just optimizing their software better.

Why? Because the real precious resource today is not CPUs, but brains. The time of software engineers is much more precious today. For most applications, cleanly written software, concise programs, programs that are short and easy to debug are more important than a performance gain in an unusual corner case: The program yes is not commonly used in high-performance pipelines; a typical use is "yes | rm -i ...", which is performance-limited in the file system or the rm program, not in yes. Adding 10 lines of code to yes to make it faster is a very bad investment, if it might cause it to have a bug, or be harder to work on in the future.

And this is why slower high-level routines like fwrite() in C exist, and programming languages that are by their nature not so CPU efficient. Shorter programs tend to have fewer bugs. High-level library routines tend to have very few bugs (they tend to be tested really well by many people), so using them makes for better code, code that is more reliable, and code that can be written more quickly. In most cases, we don't care that it runs a little more slowly.

Was it Dykstra who said the following? "There are two kinds of programs: Those that are so short that they obviously have no bugs. And those that are so long that they have no obvious bugs".

Now, you might claim that yes is so short that it can not possible have a bug. And I will refute that argument by telling you the story of the program IEFBR14, which used to consist of a single assembly instruction: "BR 14". It turns out that even though that program was very simple, it had a bug, which was only found after many years of use (on what was then some of the largest computers on the planet): When used in a "pipeline" (a conditional sequence of job steps), it could sometimes cause the next program to not run. So a second instruction had to be added.
 
There are many different types of “files” you might be operating on — pipes to other processes, a socket to a remote system, a file on a filesystem mounted with ‘sync’ and backed by a slow HDD, a tape drive, a file on a ram disk, etc. — and you might be trying to write small log files a few bytes at a time or gigantic media files a fast as your little legs can run.

The point is, there is no one-size-fits-all best practice or interface. The buffered I/O of libc can outperform (depending on your metric of “perform”) raw write/read calls for some I/O patterns and “files.” And, as shown, the raw interfaces can perform better in other cases.

It is clear that the current-till-reddit-shaming “yes” code (intended to provide “y” inputs to a simple interactive program) was not designed to “write ‘y’ as fast as possible” because that’s rather silly. It was certainly more succinct and easier to visually inspect for correctness, but apparently we’ve got our rulers out.

N.B. that the “buffering” that is going on in the optimized yes is not quite the same as traditional file buffering, because here we know a priori what it is we want to write (over and over again) while libc buffering doesn’t know what will be coming at it, how large it will eventually be, how soon the next bytes will be coming, etc.
 
Actually, I wonder whether a "optimized" buffered version of yes may actually have a bug: It tries to write a huge amount (obsigna's example above is 1MiB) at a time. If the output of yes is pipelined through something like netcat, over a tcp link, I wonder if that might outright fail, or cause a slowdown as the block has to be broken up to whatever limit the socket library has.

The whole "reddit shaming" thing is pretty idiotic to begin with. Anyone who thinks that the quality of an operating system is measured by how fast yes outputs the "y" character is an idiot, and they get the operating system they deserve. I would even claim that for 99% of all computer users, performance of modern operating systems is indistinguishable, and more than good enough, and that other factors are much more important in selecting your OS (the other 1% probably work in classified or highly commercial environments, and won't post here).
 
Is it CPU time? No. Otherwise there would be no software written in Java or Python, which do run slower than the best-optimized C code.

I keep hearing this statement again and again. It's theoretically right, but in reality you think you could write something called best-optimized C code? I don't think so. Just ASM people moved to C and change "best optimized ASM" to "best optimized C", nothing more.
 
Actually, I wonder whether a "optimized" buffered version of yes may actually have a bug: It tries to write a huge amount (obsigna's example above is 1MiB) at a time. If the output of yes is pipelined through something like netcat, over a tcp link, I wonder if that might outright fail, or cause a slowdown as the block has to be broken up to whatever limit the socket library has.
The write(2) function takes care for this. Actually, if I set the buffer size to 1 in my example, then it's still 20times faster than yes(1). The code of yes on FreeBSD 11.1 is /usr/src/usr.bin/yes/yes.c
C:
/*
* Copyright (c) 1987, 1993
*    The Regents of the University of California.  All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
*    notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
*    notice, this list of conditions and the following disclaimer in the
*    documentation and/or other materials provided with the distribution.
* 4. Neither the name of the University nor the names of its contributors
*    may be used to endorse or promote products derived from this software
*    without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*/

#ifndef lint
static const char copyright[] =
"@(#) Copyright (c) 1987, 1993\n\
    The Regents of the University of California.  All rights reserved.\n";
#endif /* not lint */

#ifndef lint
#if 0
static char sccsid[] = "@(#)yes.c    8.1 (Berkeley) 6/6/93";
#else
static const char rcsid[] = "$FreeBSD: releng/11.1/usr.bin/yes/yes.c 216370 2010-12-11 08:32:16Z joel $";
#endif
#endif /* not lint */

#include <err.h>
#include <stdio.h>

int
main(int argc, char **argv)
{
    if (argc > 1)
        while (puts(argv[1]) != EOF)
            ;
    else
        while (puts("y") != EOF)
            ;
    err(1, "stdout");
    /*NOTREACHED*/
}

To start with, it does not repetitively output "y", like stated in the manual, but it does repetitively output "y\n". So either the manual got a bug or yes must not use puts(3). The bottleneck within puts is that it locks/unlocks the output stream for writing, presumably, in order to guarantee that the written phrase cannot be interlaced with other strings from another thread of the same process. And this seems to be at least questionable for a single threaded command which is usually supposed to output single chars.

The whole "reddit shaming" thing is pretty idiotic to begin with. Anyone who thinks that the quality of an operating system is measured by how fast yes outputs the "y" character is an idiot, and they get the operating system they deserve. I would even claim that for 99% of all computer users, performance of modern operating systems is indistinguishable, and more than good enough, and that other factors are much more important in selecting your OS (the other 1% probably work in classified or highly commercial environments, and won't post here).
Most of the discussions on reddit seems to be simply fuss, and I won't comment on hot air. When, I responded, I didn't even read anything on reddit. I was only curious where the huge speed difference might come from, and IMHO, it is the inappropriate use of puts() instead of write(). Anyway, I didn't and still don't claim at all, that my test snippet would be a suitable yes replacement. BTW, my snippet showed that with respect to speed a 128 kb buffer reaches the saturation.
 
Back
Top