Here’s a little [cmd=]find[/cmd] trick that not many people seem to know:
That’s quite a large difference! All we did was swap the [cmd=];[/cmd] for a [cmd=]+[/cmd].
Let’s see what POSIX has to say about it (emphases mine):
Or in slightly more normal English: If you use [cmd=];[/cmd], [cmd=]find[/cmd] will execute the utility once for every path; if you use [cmd=]+[/cmd], it will cram as many paths as it can in an invocation.
How many? Well, as many as [cmd=]ARG_MAX[/cmd] allows. Quoting from POSIX Again:
Most contemporary systems have it set much higher though; Linux (3.16, x86_64) defines [cmd=]ARG_MAX[/cmd] as 131072 (128k), while FreeBSD (10, i386) gives it as 262144 (256k).
Let’s verify this with [cmd=]truss[/cmd][^1]:
Caveat
There is one small caveat, this won’t work:
Going back to POSIX:
In other words, the command needs to end with [cmd=]{} +[/cmd]. [cmd=]cp {} /tmp +[/cmd] doesn’t, and thus gives an error.
We can work around this by spawning a [cmd=]sh[/cmd] one-liner:
Code:
# 13 seconds...
$ time find . -type f -exec stat {} \; > /dev/null
13.20s real 3.94s user 9.22s sys
# 1.5 seconds! That's almost 10 times faster!
$ time find . -type f -exec stat {} + > /dev/null
1.48s real 0.68s user 0.79s sys
# Run the first command again, to make sure we’re not being biased by fs
# cache or got some fluke
[~]% time find . -type f -exec stat {} \; > /dev/null
13.40s real 3.67s user 9.51s sys
# FYI...
[~]% find . -type f | wc -l
2641
That’s quite a large difference! All we did was swap the [cmd=];[/cmd] for a [cmd=]+[/cmd].
Let’s see what POSIX has to say about it (emphases mine):
If the primary expression is punctuated by a [cmd=]<semicolon>[/cmd], the utility [cmd=]utility_name[/cmd] shall be invoked once for each pathname
[.. snip ..]
If the primary expression is punctuated by a [cmd=]<plus-sign>[/cmd], the primary shall always evaluate as true, and the pathnames for which the primary is evaluated shall be aggregated into sets. The utility [cmd=]utility_name[/cmd] shall be invoked once for each set of aggregated pathnames.
Or in slightly more normal English: If you use [cmd=];[/cmd], [cmd=]find[/cmd] will execute the utility once for every path; if you use [cmd=]+[/cmd], it will cram as many paths as it can in an invocation.
How many? Well, as many as [cmd=]ARG_MAX[/cmd] allows. Quoting from POSIX Again:
[cmd=]{ARG_MAX}[/cmd]
Maximum length of argument to the exec functions including environment data.
Minimum Acceptable Value: [cmd=]{_POSIX_ARG_MAX}[/cmd]
[cmd=]{_POSIX_ARG_MAX}[/cmd]
Maximum length of argument to the exec functions including environment data.
Value: 4096
Most contemporary systems have it set much higher though; Linux (3.16, x86_64) defines [cmd=]ARG_MAX[/cmd] as 131072 (128k), while FreeBSD (10, i386) gives it as 262144 (256k).
Let’s verify this with [cmd=]truss[/cmd][^1]:
Code:
# Amount of files we have
$ find . -type f | wc -l
2641
$ truss find . -type f -exec stat {} \; >& truss-slow
$ truss find . -type f -exec stat {} + >& truss-fast
# Less than ARG_MAX, so we expect one fork()
$ find . -type f | xargs | wc -c
119528
# Yup!
$ grep fork truss-fast | wc -l
1
# And we fork() once for every file
$ grep fork truss-slow | wc -l
2641
Caveat
There is one small caveat, this won’t work:
Code:
# FreeBSD find
$ find . -type f -exec cp {} /tmp +
find: -exec: no terminating ";" or "+"
# GNU find is even more cryptic:
$ find: missing argument to `-exec'
Going back to POSIX:
Only a [cmd=]<plus-sign>[/cmd] that immediately follows an argument containing only the two characters shall punctuate the end of the primary expression. Other uses of the [cmd=]<plus-sign>[/cmd] shall not be treated as special.
In other words, the command needs to end with [cmd=]{} +[/cmd]. [cmd=]cp {} /tmp +[/cmd] doesn’t, and thus gives an error.
We can work around this by spawning a [cmd=]sh[/cmd] one-liner:
Code:
$ find . -type f -exec sh -c 'cp "$@" /tmp' {} +