PDA

View Full Version : need help with sed and regexps


edhunter
October 16th, 2009, 09:33
Hello guys
I need help with sed and regular expressions.
I have an input file containing text with html formatting.
I have to import this file into another program that respects only </br> tag.
I need to clean all html tags except <br> and variations of it, before importing this file.
All kind of br-s have to become </br>.

something like that:
1. <br>,<br />,</br> ... => </br>
2. "<whatever tag withot </br> >" => ""

How could i do it using sed?

dennylin93
October 16th, 2009, 13:43
sed 's/<br>/<\/br>/g' should turn <br> into <br />. All the other changes should work with similar variations.

Zare
October 16th, 2009, 15:41
sed 's@<\([^<br>][^<>]*\)>\([^<>]*\)</\1>@\2@g'


Pipe the line into this and it should strip off all HTML tags, the content between the tags will remain intact, and <br> tags will remain too.

P.S.
Up The Irons!
;)

edhunter
October 19th, 2009, 11:49
:) 10x \m/
but it didnt work
here is sample file:
line1<tag1>alabala<br>blabla</tag2>
line2<tag>blabla
<tag3>text<tag4>blabla<br>
<br></br>
</br>
< br />

here is sed output:
sed 's@<\([^<br>][^<>]*\)>\([^<>]*\)</\1>@\2@g' test.txt
line1<tag1>alabala<br>blabla</tag2>
line2<tag>blabla
<tag3>text<tag4>blabla<br>
<br></br>
</br>
< br />


I did what i want with 3 seds.
sed -e "s:<[^<>]*br[^<>]*>:uniqstring123:g" Export.TXT > out1.txt
sed -e "s:<[^<>]*>::g" out1.txt > out2.txt
sed -e "s:uniqstring123:</br>:g" out2.txt > FINAL.TXT

but my way seems very lame... thats why i need another solution :)