Solved Pulling tables from websites.

The-guy-using-BSD

New Member


Messages: 5

I'm using FreeBSD 10.1 to do this. I need to pull tables into a file. I need all the contents to be on one line for the data. I have used w3m and elinks which both are not working well. w3m gives unprintable characters. I want to get elinks working.

elinks on FreeBSD:
Code:
+----------------------------------------+
| Name | Cost | Type | Price |
+----------------------------------------+

+----------------------------------------+
| Green | $0.60 | food | $1.25 |
| Apple | | | |
+----------------------------------------+
The problem with using elinks is that the second word is dumped to a different line. Command: elinks '[URL='http://www.example.com']www.example.com[/URL]' > somefile.txt

For using elinks I'm having problems making a script to make a new file that corrects that problem.

Edit: What it should look like
Code:
+----------------------------------------+
| Name | Cost | Type | Price |
+----------------------------------------+

+----------------------------------------+
| Green Apple | $0.60 | food | $1.25 |
+----------------------------------------+
Edit: SOMETHING I FORGOT
when using elinks to just browse to the page it looks just like what I want. It might be a line length limit thing.

Edit: The answer

make sure the top of the file is the first product. I would cut the lines before it then cat this code back into the file.

Code:
linenum=`wc -l < "$file"`

clnum=1
lanum=2
ff=finalfile.txt
rm $ff
while [ "$clnum" -lt "$linenum" ] ; do
        cl=`awk "NR==$clnum" "$file"`
        la=`awk "NR==$lanum" "$file"`
        echo "$la" | grep \+ >> /dev/null
        error="$?"
        if [ $error -ne "0" ] ; then
                fh=`echo $cl | awk -F \| '{print $2}'`
                echo $fh
                sh=`echo $la | awk -F \| '{print $2}'`
                echo $sh
                total=$fh$sh
                echo $total
                echo $cl > tempfll.txt
                gsed -i "s/$fh/$total/g" tempfll.txt
                cat tempfll.txt >> $ff
                rm tempfll.txt
                clnum=`dc -e "$clnum 1+p"`
                lanum=`dc -e "$lanum 1+p"`
                addsep=`awk "NR==$lanum" "$file"`
                echo "$addsep" >> "$ff"
        else
                echo "$cl" >> "$ff"
                echo "$la" >> "$ff"
        fi
        clnum=`dc -e "$clnum 2+p"`
        lanum=`dc -e "$lanum 2+p"`
done
[\code]
 

SirDice

Administrator
Staff member
Administrator
Moderator

Reaction score: 7,409
Messages: 29,985

Yep, www/p5-libwww has a few modules that could do this. It would require some Perl scripting knowledge though. The LWP::Simple module is fairly easy to use.
 

wblock@

Administrator
Staff member
Administrator
Moderator
Developer

Reaction score: 3,638
Messages: 13,850

Another alternative is to just grab the page with fetch(1) and parse the data out of the HTML.
 
Top