Solved Weird behaviour with piping STDOUT to /dev/null

ironudjin · Mar 7, 2023

Hello,

I'm experimenting with data filtering use some of console programs and found weird behaviour with piping STDOUT of GNU grep to /dev/null.

File test is 10Gb of data (apache log).

Code:

$ /usr/bin/time -h ggrep 'GET' test > /dev/null
    0,00s real        0,00s user        0,00s sys

It seems ggrep output nothing as it's even didn't start data processing and exitted without any error out.
But if I pipe output to cat and and then to /dev/null - everything works fine:

Code:

$ /usr/bin/time -h ggrep 'GET' test | cat > /dev/null
    24,40s real        10,70s user        13,45s sys

It outputs to file also without any problem:

Code:

$ /usr/bin/time -h ggrep 'GET' test > ~/test.output
    22,80s real        10,97s user        11,37s sys

The same problem with textproc/ugrep:

Code:

$ /usr/bin/time -h ugrep 'GET' test > /dev/null
    0,00s real        0,00s user        0,00s sys

There is no such problem with for example BSD grep or GNU Awk:

Code:

$ /usr/bin/time -h gawk '/GET/' test > /dev/null
    21,78s real        13,41s user        8,36s sys

My shell is tcsh but result is the same with sh, bash, csh.

Does anyone have an idea of why such happen?

Thank you!

zirias@ · Mar 7, 2023

See https://git.savannah.gnu.org/cgit/grep.git/tree/src/grep.c#n2875

So, working as designed. GNU grep detects whether stdout is /dev/null and avoids any work that's only needed for output on stdout in that case.

ironudjin · Mar 7, 2023

Oh... I thougt that the reason somewhere outside ggreap. Thank you.

In there a way to cheat it without source edit? I tried:

$ ln -s /dev/null ~/null
$ ggrep 'google' test >> ~/null
        0,00 real         0,00 user         0,00 sys

It seems this piece of code doesn't allow that:

C:

         if (stat ("/dev/null", &null_stat) == 0
              && SAME_INODE (tmp_stat, null_stat))

I'm trying to evaluate and compare text filtering spead of diffrent console tools.

zirias@ · Mar 7, 2023

I guess trying to do that, you'll have a lot of influential factors. Just adding the | cat >/dev/null to all of them should probably be acceptable.

ironudjin · Mar 7, 2023

With cat and without the results are quite differ:

$ /usr/bin/time -h gawk '/GET/' test > /dev/null
    16,34s real        13,69s user        2,64s sys
$ /usr/bin/time -h gawk '/GET/' test | cat > /dev/null
    20,81s real        13,46s user        7,33s sys

zirias@ · Mar 7, 2023

Of course they are. But as you can see not for the "user" part, which makes sense as cat doesn't really do anything interesting, just shoveling input to output. So, what's added is mostly the kernel work for the pipe (which will of course increase with the amount of output created).

ralphbsz · Mar 7, 2023

The fact that grep doesn't use much CPU time, and probably doesn't read much of the input makes perfect sense, if the search string is common. For fun, you could try to search for a string that is either extremely uncommon or doesn't even occur in the input file (not "GET", but "QwErTiOp"), and now it should use a lot of time and CPU to read all of the input.

It makes perfect sense that programs such as grep suppress creating output if they detect that the output will be ignored. And testing whether the output file is identical to (has the same inode as) /dev/null is a halfway decent test for that. But they still need to return in their status code whether the string was found. With a search string such as "GET" on an Apache log, that probably happens within the first few lines, so then they can abandon reading and processing.

The use case of "fgrep XXX YYY > /dev/null" is actually quite common; it is often used in scripts to test whether a string occurs in a file or stream, and optimizing for it is a good investment.

Solved Weird behaviour with piping STDOUT to /dev/null

ironudjin

zirias@

ironudjin

zirias@

ironudjin

zirias@

ralphbsz