I've worked out two additional solutions of my own, reflecting different ways to solve your problem:
- use sed(1) and make 'smart' use of utilities
- let sed(1) do as much as possible, minimizing its invocations
Perhaps this helps to solve future problems using utilities, and sed in particular.
There are many ways leading to Rome.
I only don’t understand what you have against cut? I can’t imagine writing any of my monstrosities of one-liners without it. For example, [...]
Well, you asked ...
Thank you for providing an example, solution and example run; I found the explanation helpful too. Unfortunately, often a problem or code trying to solve a problem comes without a useful description.
As to my previous answer, I had a longer answer prepared but I trimmed it. There is of course nothing against
cut(1) or
tr(1). In my personal, limited, experience when I want to go from a to b, sometimes I need more than
cut(1) can provide, or I can easily integrate the desired functionality in adjacent commands at the other side of the fence, that is:
|. That might be considered a personal preference. All utilities have their particular use cases. However, when one chooses to use more utilities (as opposed to some extra coding), it usually pays to carefully read man pages to avoid unnecessary coding.
nl(1) provides useful options in this case.
Regarding the problem at hand. I imagine the original source is better formatted (e.g. XML or HTML). In that case, a parser would be a good tool. Often, we have to work with the cards we're dealt. If further, more flexible data manipulation is needed, the obvious approach would be to structure all the data and store it appropriately—perhaps in a database. As we're dealing with plain text (not even
<tab> formatted), sed & awk (with additional assistance) seem like good candidates. As a result of working on my sed solutions, I can't see an awk solution that does things differently or more easily than sed, so I only used sed.
Two things that sed cannot do is sorting and formatting columns, so external help is needed here. In addition to
sed(1), only one call of
sort(1) and
column(1) remain as part of my second solution. You might have already noticed by careful reading: I've cut out
cut(1), again! However, I agree that using cut is appropriate in this context, even though it increases the number of commands and pipes.
Where I don't comment my first solution, I do provide comments in the script files of my second solution. There, I also handled the right justification of numbers below 100. In much the same way you did; however, one doesn't have to do that when using
nl(1), as is demonstrated in the first solution.
I've decided to put both solutions in a spoiler. If one wants to work out an alternative solution without my solutions in plain view, one can choose to do so; viewing them is just a couple of clicks away though. While using sed, I ran into what looks like a sed bug. That took considerable time to analyse; I've worked around it. I used the file
i, created by:
elinks -dump "https://web.archive.org/web/20240316090019/http://mim.stupi.net/nodedb" > i
Main processing steps - second solution
Rich (BB code):
$ cat i | \
sed -Ef preSort.sed | \
sort -uk1 | \
sed -Ef postSortLN.sed | \
sed -Ef mergeRjustLN.sed | \
column -x
sed solution using cut, sort, nl & column'
Rich (BB code):
$ cat i | \
sed -n '37,1143 p' | \
cut -c 20-45 | \
sed -E 's/ *$//;s/(.* )([^ ]+)$/\2\1/' | \
sort -uk1 | \
sed -E 's/^([^ ]+)( .*)/\2\1/' | \
nl -nrn -s'.' | \
column -x
sed solution using sort & column - sed script files'
Rich (BB code):
[1-0] % cat preSort.sed
# select table rows by deleting other lines
# cut out name column at the start of first name
# remove line ending spaces
# move surname to the front as preparation for sort
1,36 d
1144,$ d
s/.{20}(.{22}).*/\1/
s/ *$//
s/(.*) ([^ ]+)$/\2 \1/
[2-0] % cat postSortLN.sed
# restore original name order
# generate line numbers, to be processed later
s/^([^ ]+) (.*)/\2 \1/
=
[3-0] % cat mergeRjustLN.sed
# merge line numbers
# add line number formatting
N
s/\n/. /
# right justify single and double digit line numbers
s/^[^.]\./ &/
s/^[^.]{2}\./ &/
Rich (BB code):
$ cat i | sed -Ef preSort.sed | sort -uk1 | sed -Ef postSortLN.sed | sed -Ef mergeRjustLN.sed | column -x
1. Mark Abene 2. Peter Allan 3. Anders Andersson 4. Brian Angus 5. Paul Anokhin
6. Robert Armstrong 7. Madeline Autumn-Rose 8. Moishe Bar 9. Mark Benson 10. Jean-Yves Bernier
11. Mark Berryman 12. Johnny Billquist 13. Mark J. Blair 14. Robin Blair 15. Tony Blews
<snap>
101. Peter Whisker 102. Mark Wickens 103. Dan Williams 104. John Wilson 105. Julian Wolfe
106. Frank Wortner 107. Alice Wyan 108. John Yaldwyn 109. Connor Youngquist
___
Edit: there's final editing for you: spaces fly where there not supposed to. Changed
preSort.sed as intended.
30 Nov: Spoilers removed