grepping a grep result file using a wildcard is rubbish

kenorb · Dec 13, 2010

Code:

> ll | wc -l
     169
> grep "download" * | grep href > list.txt
load: 1.27  cmd: grep 9946 [wdrain] 681.27r 156.45u 104.06s 37% 1120k
load: 1.25  cmd: grep 9946 [wdrain] 686.78r 157.55u 104.84s 34% 1120k
load: 1.15  cmd: grep 9946 [wdrain] 693.05r 158.74u 105.69s 33% 1120k
^C
> time grep "download" * > list.txt
^C44.297u 32.942s 3:08.01 41.0%	107+1515k 220+106081io 0pf+0w

Already spent 15minutes to grep 169 text files (around 30k each) by one word, then cancelled to check what's going on, already tried 3-4 times, during this time I can't use my Desktop, because all 4 cores are almost 100%, WTF?!

How to install GNU grep?
UPDATE: I found it.

Code:

> sudo portinstall gnugrep

kenorb · Dec 13, 2010

See:
http://www.mail-archive.com/freebsd-current@freebsd.org/msg124281.html
http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

Looks like it's more than 5 times slower?;/
OMG

kenorb · Dec 13, 2010

Code:

> time grep "download" * > list.txt
/usr/local/bin/grep: writing output: No space left on device
/usr/local/bin/grep: writing output: No space left on device
/usr/local/bin/grep: write error
115.987u 121.518s 10:25.48 37.9%	216+1418k 1641+393603io 1pf+0w
-rw-r--r--  1 kenorb  kenorb    48G Dec 13 15:12 list.txt

48G????
Ah, I forgot, it's BSD! Star is alias for your whole drive, wherever you are. Very intuitive.

wblock@ · Dec 13, 2010

kenorb said:
Code:

> ll | wc -l 169 > grep "download" * | grep href > list.txt load: 1.27 cmd: grep 9946 [wdrain] 681.27r 156.45u 104.06s 37% 1120k load: 1.25 cmd: grep 9946 [wdrain] 686.78r 157.55u 104.84s 34% 1120k load: 1.15 cmd: grep 9946 [wdrain] 693.05r 158.74u 105.69s 33% 1120k ^C > time grep "download" * > list.txt ^C44.297u 32.942s 3:08.01 41.0% 107+1515k 220+106081io 0pf+0w

Already spent 15minutes to grep 169 text files (around 30k each) by one word, then cancelled to check what's going on, already tried 3-4 times, during this time I can't use my Desktop, because all 4 cores are almost 100%, WTF?!

That's ridiculously slow, and obviously broken. Running that grep sequence on /usr/src here (538M) takes only a few seconds the first time, and even less when the files are in cache. wdrain implies something else is wrong. Post an archive of the files and the exact commands used, and I'll test it on my machine.

kenorb · Dec 13, 2010

I don't know how, but this works fine:

Code:

> grep -R "download" * | cat > list.txt

Without cat's pipe, by default FreeBSD assuming that I want to grep my whole drive, even if I'm in my folder with 129 files?

OR
It's a big loop by grep'ing the file which it appending the matches.

wblock@ · Dec 13, 2010

kenorb said:

Code:

> time grep "download" * > list.txt
/usr/local/bin/grep: writing output: No space left on device
/usr/local/bin/grep: writing output: No space left on device
/usr/local/bin/grep: write error
115.987u 121.518s 10:25.48 37.9%	216+1418k 1641+393603io 1pf+0w
-rw-r--r--  1 kenorb  kenorb    48G Dec 13 15:12 list.txt

48G????
Ah, I forgot, it's BSD! Star is alias for your whole drive, wherever you are.

No.
% man -P'less +5/"Filename substitution"' csh

But it sounds like you've proven that grep wasn't at fault.

wblock@ · Dec 13, 2010

kenorb said:
I don't know how, but this works fine:

Code:

> grep -R "download" * | cat > list.txt

Without cat's pipe, by default FreeBSD assuming that I want to grep my whole drive, even if I'm in my folder with 129 files?

Your grep command is changing (but that -R was in the original problem, wasn't it?). -R (or -r) is a recursive grep. * expands to every file *and* directory in the current directory, and grep searches them all recursively.

kenorb · Dec 13, 2010

SUCCESS TEST ON PLAIN FILES:

Code:

perl -e '$i = 0; while($i++ < 100) { system("echo xx test xx > file$i.txt"); }'
> grep test * > zz.txt
> grep test * > zz.txt
> grep test * > zz.txt
> grep test * > zz.txt

No any problems.

FAIL TEST parsing html files:

Code:

> perl -e '$i = 1; while($i++ < 5) { system("wget -nc \"http://ai-contest.com/rankings.php?page=$i\""); }'
> grep "td" * > list.txt
# WORKS
> grep "td" * > list.txt
# WORKS
> grep "td" * > zz.txt
# BIG FREEZE UNTIL YOU RUN OF SPACE!
load: 0.66  cmd: grep 39619 [biord] 68.52r 34.09u 15.73s 72% 1156k
load: 0.74  cmd: grep 39619 [running] 78.90r 39.41u 18.58s 80% 1156k
load: 0.96  cmd: grep 39619 [running] 118.59r 60.64u 28.60s 82% 1156k
load: 0.75  cmd: grep 39619 [running] 267.90r 122.22u 62.36s 53% 1156k

For sure there is a bug.
I don't know what's the difference between list.txt and zz.txt, but on zz.txt it always freezing, on list.txt it doesn't;/
It freezing always when you use the name as last file in alphabetical order.
It does work when you grep "table", but doesn't when you grep "td". Crazy!

kenorb · Dec 13, 2010

wblock said:
Your grep command is changing (but that -R was in the original problem, wasn't it?). -R (or -r) is a recursive grep. * expands to every file *and* directory in the current directory, and grep searches them all recursively.

I tried -R only once, the rest examples are without -R.

wblock@ · Dec 13, 2010

kenorb said:

FAIL TEST parsing html files:

Code:

> perl -e '$i = 1; while($i++ < 5) { system("wget -nc \"http://ai-contest.com/rankings.php?page=$i\""); }'
> grep "td" * > list.txt
# WORKS
> grep "td" * > list.txt
# WORKS
> grep "td" * > zz.txt
# BIG FREEZE UNTIL YOU RUN OF SPACE!

No problem here. Create an empty directory, put just those files in it, and try again.

kenorb · Dec 13, 2010

Trying to debug the grep, giving the weird stuff:

Code:

39668: read(3,"xt:zz.txt:zz.txt:zz.txt:zz.txt:z"...,24576) = 24576 (0x6000)
39668: write(1,"zz.txt:zz.txt:zz.txt:zz.txt:zz.t"...,16384) = 16384 (0x4000)
39668: read(3,"t:zz.txt:zz.txt:zz.txt:zz.txt:zz"...,24576) = 24576 (0x6000)
39668: write(1,":zz.txt:zz.txt:zz.txt:zz.txt:zz."...,16384) = 16384 (0x4000)
39668: write(1,"z.txt:zz.txt:zz.txt:zz.txt:zz.tx"...,16384) = 16384 (0x4000)
39668: read(3,"z.txt:zz.txt:zz.txt:zz.txt:zz.tx"...,24576) = 24576 (0x6000)
39668: write(1,"txt:zz.txt:zz.txt:zz.txt:zz.txt:"...,16384) = 16384 (0x4000)
39668: read(3,"t:zz.txt:zz.txt:zz.txt:zz.txt:zz"...,24576) = 24576 (0x6000)
39668: write(1,":zz.txt:zz.txt:zz.txt:zz.txt:zz."...,16384) = 16384 (0x4000)
39668: write(1,"z.txt:zz.txt:zz.txt:zz.txt:zz.tx"...,16384) = 16384 (0x4000)
39668: read(3,":zz.txt:zz.txt:zz.txt:zz.txt:zz."...,24576) = 24576 (0x6000)
39668: write(1,".txt:zz.txt:zz.txt:zz.txt:zz.txt"...,16384) = 16384 (0x4000)
39668: read(3,"xt:zz.txt:zz.txt:zz.txt:zz.txt:z"...,24576) = 24576 (0x6000)
39668: write(1,"xt:zz.txt:zz.txt:zz.txt:zz.txt:z"...,16384) = 16384 (0x4000)

For sure it's a bug with loop.

This one:
http://savannah.gnu.org/bugs/?17457
After 4 years of reporting somebody decided that it can't be fixed, LOL!

wblock@ · Dec 13, 2010

Huh. So GNU grep at least does that, where the output file is read as input. I thought it might be that, but couldn't duplicate it. This is more a bug of expectations than anything else. You see the way to avoid this, right? Oh, and are you going to change the title of the thread to something more accurate?

kenorb · Dec 13, 2010

Reported the bug here:
http://www.freebsd.org/cgi/query-pr.cgi?pr=153124

Code:

> mkdir test3 && cd test3
> perl -e '$i = 1; while($i++ < 5) { system("wget -qnc \"http://ai-contest.com/rankings.php?page=$i\""); }'
> time grep "td" * > zz.txt
^T
load: 0.63  cmd: grep 39810 [wdrain] 28.95r 15.76u 6.99s 69% 1176k
load: 0.66  cmd: grep 39810 [running] 35.93r 18.80u 8.63s 68% 1176k
load: 0.66  cmd: grep 39810 [wdrain] 39.71r 20.64u 9.51s 73% 1176k
load: 0.68  cmd: grep 39810 [running] 43.32r 22.53u 10.36s 72% 1176k

Freeze.

On another console:

Code:

> truss -fp `pidof grep`

39810: write(1,".txt:zz.txt:zz.txt:zz.txt:zz.txt"...,16384) = 16384 (0x4000)
39810: write(1,"t:zz.txt:zz.txt:zz.txt:zz.txt:zz"...,16384) = 16384 (0x4000)
39810: read(3,".txt:zz.txt:zz.txt:zz.txt:zz.txt"...,28672) = 28672 (0x7000)
39810: write(1,".txt:zz.txt:zz.txt:zz.txt:zz.txt"...,16384) = 16384 (0x4000)
39810: read(3,"t:zz.txt:zz.txt:zz.txt:zz.txt:zz"...,24576) = 24576 (0x6000)
39810: write(1,"txt:zz.t^C(0x7000)
^C^C^C^C^C^C^C^C^C^Z
Suspended
> sudo killall -9 truss

Code:

> grep --version
grep (GNU grep) 2.5.1-FreeBSD
> uname -a
FreeBSD kenorb 8.1-STABLE FreeBSD 8.1-STABLE #4: Mon Nov 15 14:40:15 GMT 2010     root@kenorb:/usr/obj/usr/src/sys/BRO  amd64

wblock@ · Dec 13, 2010

Put the output file in another directory--not a subdir if you're using -r--so that it isn't read as input, then written as output, then read as input, then written as output, then read as input, then written as output...

DutchDaemon · Dec 13, 2010

The thread title should now look more informed.

phoenix · Dec 13, 2010

Redirection occurs before shell expansion. Thus, your command is creating the list.txt file first, then it is expanding * to include all the files in the current directory including your output file.

I'm guessing, list.txt is listed alphabetically before any files that match the search string, thus it's empty when grep gets to it, so there's no problem. zz.txt will be listed alphabetically at the end of the list of files, so grep will have written a bunch of lines to it already. Once grep opens it for reading, you get into a loop, since every line matches, so every line is written out to the file, and grep never reaches the end of the file.

This is not a grep issue. It's an "I've written a stupid command that does exactly what I tell it to, but that's not what I want, therefore it's a bug" error. More commonly known as PEBKAC.

Re-do your command so that the output file is not in the same directory as your input files, or use a more restrictive wildcard search than just *, or any number of other things that will avoid this issue.

Reading the man page for you shell of choice would also be helpful. This is covered in there.

Oh, and you can close your PR. It's not a bug in grep.

DutchDaemon · Dec 14, 2010

And I'm closing this thread, because it is a bug. And rubbish

grepping a grep result file using a wildcard is rubbish

kenorb

kenorb

kenorb

wblock@

kenorb

wblock@

wblock@

kenorb

kenorb

wblock@

kenorb

wblock@

kenorb

wblock@

DutchDaemon

Administrator

phoenix

DutchDaemon

Administrator