Solved sort -u appears not to be working

If I understand the man page correctly, sort -u run against a list with this form:

Code:
foo
foo
bar
mumble
mumble
mumble
fratz

should yield

Code:
foo
bar
fratz
mumble
.

Yes?

I just want to check my understanding before writing up a bug report, because it does do the sort for me, but does not remove the duplicates. A quick search of the bugs database didn't turn up anything, which makes me suspect the bug is in me.
 
The output is sorted alphabetically, so:
Code:
$ printf "foo\nfoo\nbar\nmumble\nmumble\nmumble\nfratz\n" | sort -u
bar
foo
fratz
mumble
What output do you get?

Code:
$ sort --version
2.3-FreeBSD
 
That's weird. I got what you got. But in this case:
head -n 10 line_tags | sort -k 2 -u
where head grabs the first 10 lines of a humongous file:
Code:
<tag k="name" v="Jurgella Lane"/>
<tag k="highway" v="residential"/>
<tag k="created_by" v="JOSM"/>
<tag k="hgv" v="designated"/>
<tag k="ref" v="US 51 Business"/>
<tag k="name" v="Division Street"/>
<tag k="oneway" v="yes"/>
<tag k="highway" v="primary"/>
<tag k="maxspeed" v="35 mph"/>
<tag k="name" v="North Point Drive"/>
...I get...
Code:
<tag k="created_by" v="JOSM"/>
<tag k="hgv" v="designated"/>
<tag k="highway" v="primary"/>
<tag k="highway" v="residential"/>
<tag k="maxspeed" v="35 mph"/>
<tag k="name" v="Division Street"/>
<tag k="name" v="Jurgella Lane"/>
<tag k="name" v="North Point Drive"/>
<tag k="oneway" v="yes"/>
<tag k="ref" v="US 51 Business"/>
...sorting on the field, but not removing the dupes.
 
I'm pretty sure I've just given myself a headache trying to understand the sort documentation, however this bit stands out -

Fields are specified by the -k field1[,field2] command-line option. If
field2 is missing, the end of the key defaults to the end of the line.

Specifying the second field (which should be k="X") as the start and end field for the match seems to work -

Just specifying -k 2, the key seems to include the everything from k= to the end of the line:
(The output here is a bit confusing because it seems to use <> around the key in the debug output, which is also present at the end of your strings.)
Code:
# cat test.dat | sort -k 2 -u --debug
<tag k="oneway" v="yes"/>
; k1=< k="oneway" v="yes"/>>
[...]

With -k 2,2:
Code:
# cat test.dat | sort -k 2,2 -u --debug
<tag k="oneway" v="yes"/>
; k1=< k="oneway">
...

Code:
# cat test.dat | sort -k 2,2 -u

<tag k="created_by" v="JOSM"/>
<tag k="hgv" v="designated"/>
<tag k="highway" v="residential"/>
<tag k="maxspeed" v="35 mph"/>
<tag k="name" v="Jurgella Lane"/>
<tag k="oneway" v="yes"/>
<tag k="ref" v="US 51 Business"/>
#
 
My brain hurts! What could possibly produce this effect?

Code:
14:46:29 Mon, 13 Mar                                                                                                                  [fastcat:root]/V/raw> head -n 10 line_tags | sort -k 2.5 -u
<tag k="created_by" v="JOSM"/>
<tag k="hgv" v="designated"/>
<tag k="highway" v="primary"/>
<tag k="highway" v="residential"/>
<tag k="maxspeed" v="35 mph"/>
<tag k="name" v="Division Street"/>
<tag k="name" v="Jurgella Lane"/>
<tag k="name" v="North Point Drive"/>
<tag k="oneway" v="yes"/>
<tag k="ref" v="US 51 Business"/>

14:46:33 Mon, 13 Mar                        
[fastcat:root]/V/raw> head -n 10 line_tags | sort -k 2.6 -u
<tag k="name" v="Division Street"/>
<tag k="name" v="Jurgella Lane"/>
<tag k="name" v="North Point Drive"/>
<tag k="maxspeed" v="35 mph"/>
<tag k="ref" v="US 51 Business"/>
<tag k="hgv" v="designated"/>
<tag k="highway" v="primary"/>
<tag k="highway" v="residential"/>
<tag k="oneway" v="yes"/>
<tag k="created_by" v="JOSM"/>
 
Your 2,2 definitely seems to work for the example, but I don't know why, especially since 2,6 does not work. Nor does 2,1, 2,3, or any other combination, apparently.

I read the docs as saying that specifying a length requires dot notation, e.g. 2.2 except in countries where comma and dot are flipped in meaning. Moreover, the default field separator is 1 or more spaces, so that should have truncated field 2 appropriately but obviously didn't!

My brain definitely hurts!
 
I have no idea why you are using 2.6 as the field to sort by?

The argument to -k should be the first field to use in the sort key, followed by the last field (separated by a comma). If no last field is specified it goes to the end of the line.

Take one line of your file
Code:
<tag k="maxspeed" v="35 mph"/>
The default field separator is blank space, so your string is composed of 3 fields:

Code:
<tag
 k="maxspeed"
 v="35 mph"/>
Note that both fields 2 & 3 start with a space.

If you specify -k 2 it will use field 2 to the end of the line, so it will sort using:
Code:
 k="maxspeed" v="35 mph"/>
(Again with a leading space

If you specify -k 2,2, it will start and end with field 2, which I'm pretty sure is what you want
Code:
 k="maxspeed"

I have no idea what specifying a float to the -k option does; It doesn't seem to be documented, but using debug mode it seems to cut off the first character of the second field...
Code:
k1= neway" v="yes"/>
 
Oh wait x.y is documented, but still doesn't make much sense as 2.6 should start the sort key at the 6th character, i.e removing the first 5 characters, not just the first...

Edit: Ignore all that, it's missing the first character of the string in your XML (ighway instead of highway), which is actually correct if you take into account k="h (starting with a space)
 
I was just sticking in numbers to see whether I could detect some pattern--I didn't expect anything to necessarily make sense. To me, "field2", if used, should be a tiebreaker when two records have identical field1s (which may be because I've done too much ORDER BY for my own good). But your interpretation makes sense: 2,2 means sort on field 2 to field 2 inclusive because the normal default delimiter is ignored when only one field is specified! Which makes no sense but that's okaaaay! :confused:

I think I'm going to be crashing my system any time now ---I just fed that 8GB file to sort -k 2,2 -u
 
Yeah, the -k option is just specifying one key to sort by, which could be multiple fields. If you actually want to sort by one field then another, you have to specify two key specifications:
Code:
sort -k 2,2 -k 3,3

In the case of your sample data, the third field seems to always be in order even if I don't specify it. By the look of the debug output, sort automatically uses the entire string as a tie-breaker if the specifies key(s) are identical, and I can't see a way to disable that.

Edit: also this...
because the normal default delimiter is ignored when only one field is specified!
The biggest roadblock for me was realising that specifying field 2 didn't actually just use field 2, it went to the end of the line unless I specified an end field. Seems backwards. The --debug option came in very useful.
 
What messed with my head was the misuse of "default" which, where I grew up, meant "what you get unless you specify something else". Whereas here it means "what you don't get unless you specify 2 fields even if they're the same". For some reason that just clogged up my brain.
 
Back
Top