fgrep excessive time increase after upgrade from 12.4 to 14.3 - regex library issue?

This is a weird one. Since upgrading FreeBSD (jumping a couple of major versions), a script that used to run in less than a minute now takes literal hours. I've managed to distill the delay down to this use of fgrep:

cat filename.txt | fgrep --text -if filter.txt

- filename.txt contains 64 million lines, 2.4 GB
- filter.txt contains 615 lines, 6 KB

fgrep CPU usage is at 100% consistently.

I do have a backup of the 12 system, and I did notice one possible explanation:

fgrep 12.4: libgnuregex.so.5
fgrep 14.3: libregex.so.1

I cannot find any information about this change, or anyone else mentioning a regression. Hope someone has some insight. Thanks.
 
Yes, this is normal. Unfortunately. I think it might be included in the foundation's SIMD efforts for string functions in FreeBSD.

There is a port for the GNU grep versions.
 
-i sucks in the bsd lib
because instead of a normal no case comparison it expands the search text from fubarbaz to [Ff][uU][bB][aA][rR][aA][zZ]
Ouch, that might explain it.

I use grep -i in so many places, routinely, I'm now wondering where else may have been impacted...

In this instance, I've installed gnugrep from packages, and the filtering now takes 25 seconds. A massive difference.

grep: 41 lines/sec
ggrep: 4,000 lines/sec
 
here is the patch to make libregex less sucky
here is the speed diff
i cat-ed the files in /usr/share/dict 30 times in a big file
then grep-ed with both libs
Code:
[titus@utmbox ~/builds/tr]$ time grep -i bollock /tmp/huge.txt |md5
561e3653847646190f3e581c6e5c0fb4

real    0m1.028s
user    0m0.992s
sys    0m0.032s
[titus@utmbox ~/builds/tr]$ unset LD_LIBRARY_PATH
[titus@utmbox ~/builds/tr]$ time grep -i bollock /tmp/huge.txt |md5
561e3653847646190f3e581c6e5c0fb4

real    0m23.005s
user    0m22.895s
sys    0m0.114s
[titus@utmbox ~/builds/tr]$ wc -l /tmp/biga.txt
 16495170 /tmp/biga.txt

i tried netbsd's libregex testsuite and it passes all
 

Attachments

it has some bugs relative to wide chars É é Ç but should be fixable

the problem with the current code and RE_ICASE is that it emits sets of [nN] for each char and won't be able to create a *must_have* contiguous pattern and use Boyer-Moore on it
 
i should remove wide chars from the optimization and then the boyer-moore pattern will be ascii only
this is easy and for strict latin grep-ers is best
running boyer-moore which searches backwards on an utf-8 string (variable width characters) is a bitch
also im not even sure that wide chars upper and lowercase have the same number of bytes
 
to build and test do this
apply the patch
cd /usr/src/lib/libnetbsd
make
cd /usr/src/lib/libregex
make
export LD_LIBRARY_PATH=/usr/obj/home/titus/builds/usr/src/arm64.aarch64/lib/libregex/ (replace with the actual build dir)
now grep and other tools will use your new lib
unset LD_LIBRARY_PATH will get you to the original state (tools will use the stock libregex)
 
Back
Top