Get specified section/table/content from file using sed/awk/perl, etc.

kenorb · Dec 6, 2010

Each file has the same structure as follows:

HTML:

<html>
<head></head>
<body><p><table>My table here!</table></p>
</body>
</html>

I'm looking to dump only the section between <table> and </table> (including those tags).
I've spend a little while to find some solution, but still isn't clear for me, what's the easiest way to achieve that.

Tried following solutions:
http://austinmatzko.com/2008/04/26/sed-multi-line-search-and-replace/
http://www.unix.com/shell-programming-scripting/147347-how-get-one-particular-section-using-awk.html
http://www.unix.com/shell-programming-scripting/66251-remove-html-tags-bash.html
http://www.unix.com/shell-programming-scripting/58479-multiple-line-match-using-sed.html

A good start:

Code:

lynx --base --source http://ai-contest.com/rankings.php | less "+/table"

Code:

sed -n '1h;1!H;${;g;s/<h2.*/No title here/g;p;}' sample.php

Code:

perl -0777 -pe 's/\A[^\{]*\{//s; s/\}.*?\{/\n/sg; s/\}[^\}]*\Z//s'

http://www.grymoire.com/Unix/Sed.html#uh-47

wblock@ · Dec 6, 2010

kenorb said:
Code:

perl -0777 -pe 's/\A[^\{]*\{//s; s/\}.*?\{/\n/sg; s/\}[^\}]*\Z//s'

http://www.grymoire.com/Unix/Sed.html#uh-47

Aaah! My eyes!

Code:

perl -0777 -ne 'print $1 if /(<table>.*<\/table>)/' myfile.html

But properly parsing HTML is done with Perl modules, not raw regexes.

qsecofr · Dec 6, 2010

A perl solution might include /usr/ports/www/p5-HTML-TableExtract. Or search ports on "p5-HTML-Table" keyword..

kenorb · Dec 14, 2010

wblock:
Thank you for the great example, It looks very simple, I like simple solutions, but even it's, something it's missing.
Tried this command, empty result.
Tried:

Code:

perl -0777 -ne 'print $1' *

Empty output.

Code:

> echo test | perl -0777 -ne 'print \$1'
SCALAR(0x80123fde0)>

What I'm missing?

wblock@ · Dec 14, 2010

kenorb said:
wblock:
Thank you for the great example, It looks very simple, I like simple solutions, but even it's, something it's missing.
Tried this command, empty result.
Tried:

Code:

perl -0777 -ne 'print $1' *

Empty output.

Code:

> echo test | perl -0777 -ne 'print \$1' SCALAR(0x80123fde0)>

What I'm missing?

The entire regex, for a start. A regex match to fill in $1.
% man perlre | less +/Capture
The "if" is also important.

Get specified section/table/content from file using sed/awk/perl, etc.

kenorb

wblock@

qsecofr

kenorb

wblock@