grep issue?

ika256 · Jun 7, 2013

Hello,

I have different results in Linux and ~~freebsd~~ FreeBSD greps. Can somebody explain why?

a="a 123456kkk"

~~bsd~~ FreeBSD:

Code:

# echo $a
a 123456kkk
# echo $a | grep -o "[0-9]*"

#

Linux:

Code:

# echo $a
a 123456kkk
# echo $a | grep -o "[0-9]*"
123456
#

Thanks.

kpa · Jun 7, 2013

I believe there's a difference with "greediness" of the * operator somewhere. If the star operator is not greedy then the FreeBSD behaviour is fine, the regular expression [0-9]* also matches the empty string. If the star operator is assumed to be greedy then the Linux behaviour is correct.

jalla · Jun 7, 2013

Off-hand I would think the zero-or-more expression should match the numbers (i.e. no match is a bug).

Perhaps you can get by with this

Code:

stingray:~% set a="a 123456kkk"
stingray:~% echo $a | grep -E -o '[0-9]+'
123456
stingray:~%

ta0kira · Jun 8, 2013

I think it's because the first match is the beginning of the line, which is "". If you get rid of "a " then you get your match. You really can't have high expectations with an expression that matches an empty string. Also, -o will normally cause all matches on a line to be extracted, but it has to do something different with your expression because there is an empty match between every character. For comparison, look at echo $a | sed -E 's/[0-9]*/!/g', which prints "!a! !k!k!k!".

Kevin Barry

j_m · Jun 12, 2013

Code:

$ man grep
     -E, --extended-regexp
             Interpret pattern as an extended regular expression (i.e. force
             grep to behave as egrep).

$ grep --version
grep (BSD grep) 2.5.1-FreeBSD
$ egrep --version
egrep (BSD grep) 2.5.1-FreeBSD
$ echo $a | grep -E -o '[0-9]+'
123456
$ echo $a | egrep -o '[0-9]+'
123456
$ echo $SHELL
/usr/local/bin/mksh
$ echo $0
-mksh

ika256 · Jun 12, 2013

I checked now, NetBSD works lie Linux.

Code:

# uname -a
NetBSD .delta-comm.ge 6.1 NetBSD 6.1 (GENERIC) i386
#
# a="a 123456kkk"
# echo $a | grep -o "[0-9]*"
123456
#

So maybe the regexp implementation of FreeBSD has a bug?

j_m · Jun 12, 2013

ika256 said:
So maybe the regexp implementation of FreeBSD has a bug?

I think it's not a bug, it's a feature. Like kpa said, FreeBSD grep is not greedy-mode.

ika256 · Jun 12, 2013

OK, not understand why FreBSD change default behavior of greedy/lazy but thank you and @kpa for clarification.

kpa · Jun 12, 2013

I'm pretty sure that non-greediness of the closure (the star) operator has always been the default regardless of the options like -o used. The reason is that the greedy behaviour is an extension to the basic regular expressions. I have little difficulty understanding why the greedy behaviour should suddenly be the default with the -o option. Non-greedy behaviour causes some surprises but you just have to know them.

ShelLuser · Jun 12, 2013

I'm wondering which Linux distribution you tested this on, and perhaps also which shell you used. Because I can't reproduce your results:

Code:

[peter@caspar ~]$ uname -a
Linux caspar.xx.xx 2.6.18-274.17.1.el5 #1 SMP Tue Jan 10 17:25:58 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
[peter@caspar ~]$ a="a 123456kkk"
[peter@caspar ~]$ echo $a | grep -o "[0-9]*"
[peter@caspar ~]$

(the domain has been changed because I don't feel like drawing public attention to this server)

So this behaviour isn't limited to FreeBSD, some Linux distributions (CentOS in this case) act exactly the same.

NewGuy · Jun 13, 2013

Testing grep

ShelLuser said:
I'm wondering which Linux distribution you tested this on, and perhaps also which shell you used. Because I can't reproduce your results:
...
So this behaviour isn't limited to FreeBSD, some Linux distributions (CentOS in this case) act exactly the same.

I tried this experiment on a Linux box (Ubuntu 12.04) and get the expected result

Code:

$ a="a 123456kkk"
$ echo $a | grep -o "[0-9]*"
123456

In my opinion not getting the "123456" output should be considered a bug.

kpa · Jun 13, 2013

The re_format(7) manual page explains it quite well:

Code:

In the event that an RE could match more than one substring of a given
string, the RE matches the one starting earliest in the string.  If the
RE could match more than one substring starting at that point, it matches
the longest.

This is the classic leftmost-longest rule for disambiguating matches that overlap in the matched string.

Applied to the problem here the string "123456" is not the leftmost match for the regular expression "[0-9]*" but the empty string at the beginning of the string "a 123456kkk" is. Only the "leftmost" part of the rule needs to be applied here, the lengths of the different matches can be ignored.

So clearly the FreeBSD behaviour is correct in relation to the documentation and at least to me makes much more sense. I would really like to hear why the other behaviour would be correct? General handwaving not accepted anymore, provide something concrete

(Forget my mumblings about greedy/non-greedy operators in the previous posts, the problem can be solved without them).

ShelLuser · Jun 13, 2013

kpa said:
So clearly the FreeBSD behaviour is correct in relation to the documentation and at least to me makes much more sense. I would really like to hear why the other behaviour would be correct?

I think there's more to this than merely regular expressions, because if you apply a different pattern the same issue occurs:

Code:

smtp2:/home $ a="a 123456kkk"
smtp2:/home $ echo $a | grep -o "[0-9]+"
smtp2:/home $

In this example the space could never have been the first match yet the results are completely the same. Which makes me conclude that in both examples there is no match at all.

What I do consider to be a little odd is this:

Code:

smtp2:/home $ a="a 123456kkk"
smtp2:/home $ echo $a | grep -oE "[0-9]+"
123456
smtp2:/home $

The reason why is because this behaviour goes right against the manual page (grep(1)):

Code:

grep understands two different versions of regular  expression  syntax:
"basic" and "extended."  In GNU grep, there is no difference in avail-
able functionality using  either  syntax.

Unless of course this difference in behaviour isn't the doing of grep at all.

ShelLuser · Jun 13, 2013

A new message to avoid confusion. I made a small mistake up there, most likely because I usually resort to egrep:

Code:

In basic regular expressions the metacharacters ?, +, {, |,  (,  and  )
lose  their  special  meaning; instead use the backslashed versions \?,
\+, \{, \|, \(, and \).

And what do you know:

Code:

smtp2:/home $ echo $a | grep -o "[0-9]\+"
123456

So instead of showing an oddity it actually confirms the explanation which @kpa posted up there.

kpa · Jun 13, 2013

There is no difference in functionality between the basic and extended regular expressions in GNU grep(1), it's just a syntactic difference. In computer science terms, both syntaxes allow recognition of same "languages". I think it's the same in the BSD grep that comes as an alternative grep if you use WITH_BSD_GREP in src.conf(5).

It's still a puzzle why "[0-9]*" (taken as extended regexp) would match "123456" in "a 123456kkk", there has to be something else at play here.

ta0kira · Jun 13, 2013

kpa said:
I'm pretty sure that non-greediness of the closure (the star) operator has always been the default regardless of the options like -o used. The reason is that the greedy behaviour is an extension to the basic regular expressions. I have little difficulty understanding why the greedy behaviour should suddenly be the default with the -o option. Non-greedy behaviour causes some surprises but you just have to know them.

I don't think it has to do with greediness. grep on FreeBSD apparently doesn't discard a trivial match ("" at the beginning, which can't be made longer) in hope of finding a "more expected" match ("123456"), whereas other versions of grep do. When not using -o, FreeBSD grep will output every line because every line matches, but what would you think if you got this as output with -o?

Code:

[B]$[/B] a="a 123456kkk"
[B]$[/B] echo $a | grep -o "[0-9]*"


123456



[B]$[/B]

My sentiments would probably be, "Wow. Thanks a lot, captain pedantic." As you can see, both behaviors are slightly "incorrect". FreeBSD grep merely gives the first match when there are trivial matches, whereas the other greps give only the non-trivial matches. As far as I know, the "language parsing" perspective of regular expressions only goes as far as matching a single string against a single expression, not the application of a single expression to every part of a string.

If you were to use an algorithm that sets the next string position to the end of the last match, you'd get an infinite number of matches; however, that's the algorithm you'd need to use for other expressions to work. e.g.

Code:

[B]$[/B] echo ' 123' | grep -o '[0-9]'
1
2
3
$ echo ' 123' | grep -o '[0-9]*'

In the first case, "1" is the first match, then grep examines "23". In the second case, however, "" is the first match, and the same algorithm would require that grep examine " 123" (i.e. the same string,) etc., in an infinite loop, unless something different is done in recognition of the trivial match.

Regarding the rest of the conversation, "[0-9]+" is a completely different case here because it requires that the match contain at least one digit.

Kevin Barry

kpa · Jun 13, 2013

Ignore what I wrote about greediness, it's incorrect.

There's no such thing defined in grep(1) documentation as "trivial match", the empty string response to a grep -o with a regular expression that can match an empty string is not a special case and should be always treated as valid result.

The interpretation of the -o option according to the manual page is:

Code:

-o, --only-matching
             Prints only the matching [U]part[/U] of the lines.

I read that exactly as: "Print the first matching part, not every part of the string that match".

ta0kira · Jun 13, 2013

kpa said:
I read that exactly as: "Print the first matching part, not every part of the string that match".

Obviously that's not always true, since echo 123 | grep -o '[0-9]' prints three matches. I only used the word "trivial" out of habit from using it in mathematical contexts, although I do realize it isn't a part of the grep vernacular. Also, remember that a manpage isn't a standard; it's a description, and descriptions can be inaccurate. I provided my logic from a computational perspective based on the observed behavior of the program.

Kevin Barry

ika256 · Jun 14, 2013

ShelLuser said:
I'm wondering which Linux distribution you tested this on, and perhaps also which shell you used. Because I can't reproduce your results:

Here is...

Debian 7.0

Code:

Linux debian 3.2.0-4-amd64 #1 SMP Debian 3.2.41-2 x86_64 GNU/Linux
irakli@debian:~$ grep --version
grep (GNU grep) 2.12
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
irakli@debian:~$ echo $SHELL
/bin/bash
irakli@debian:~$ a="a 123456kkk"
irakli@debian:~$ echo $a | grep -o "[0-9]*"
123456
irakli@debian:~$

CentOS

Code:

[root@localhost ~]# uname -a
Linux localhost.localdomain 2.6.32-279.22.1.el6.x86_64 #1 SMP Wed Feb 6 03:10:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@localhost ~]# echo $SHELL
/bin/bash
[root@localhost ~]# grep --version
GNU grep 2.6.3

Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

[root@localhost ~]#
[root@localhost ~]# a="a 123456kkk"
[root@localhost ~]# echo $a | grep -o "[0-9]*"
123456
[root@localhost ~]#