Any ideas how to extract data from tables like these?

Hi!

Does anyone have any idea how to extract data from this kind of tables... I have lots of them :D

https://info.rtu.lv:443/rtupub/prg?ukNoteikId=1994

I've made script to download tables like these. Now I need to extract data.... and write output to other files.... later I would merge and sort these files and add info to database....

One of problems is that these html files are in UTF-8.


Any ideas?

Thanks in advance
 
I would take the easy way. Grab Python, use one of the parsers that are bundled with it and be done with it.

Since you know Perl, you could use that. I'm sure Perl has modules for parsing HTML/XML as well. If you want to go the manual route, you could build a state machine to parse the document. You could also use AWK for the state machine, if you wanna go the Unix way.
 
I'm not familiar with python, and failed with perl {I think because of Latvian native characters}

also I'm not familiar with python, any other options ?

If I could get flat text in similar form... I would be already be happy :D
 
Perl can't easily be made to work with UTF-8? Hm, I'm surprised.

Okay, this is not pretty, but I'm not really an AWK guy. All I can say is, it works. :)

Here is what my source.txt looks like -- as I understand it, this is the table you're interested in:

Code:
<table class="inform">

<tr>

        <th width="70px"> Kods </th>

        <th width="380px"> Priekšmets </th>

        <th width="95px"> Kredītpunkti </th>

</tr>



        

<tr>

        <td class="grp2"> A   </td>

        <td class="grp2"> Programmas obligātie priekšmeti </td>

        <td class="grp2" align="right"> 18 </td>

</tr>







<tr>

        <td><a href="prg/d?discId=9498">DIP403  </a></td>

        <td><a href="prg/d?discId=9498">Programmatūras plānošana un metroloģija</a></td>

        <td align="right">3 </td>

(It's not whole, I don't want to unnecessarily copy pages of text into the forum -- you get the idea what the contents are.)

And this is my parse.awk:
Code:
BEGIN {
    in_row = 0
    code = ""
    description = ""
    credits = ""
    RS = ">"
    FS = "<"
}

/<td/ && in_row == 0 {
    in_row = 1
    next
}

/<a href=.*/ && in_row == 1 {
    in_row = 2
    next
}

in_row == 2 {
    code = $1
    in_row = 3
}

/<a href=.*/ && in_row = 3 {
    in_row = 4
    next
}

in_row == 4 {
    description = $1
    in_row = 5
}

/<td align.*/ && in_row == 5 {
    in_row = 6
    next
}

in_row == 6 {
    credits = $1
    in_row = 0

    print "Code: " code ", description: " description ", credits: " credits
}

Lo and behold, it works!

Code:
[starlight] /tmp > awk -f parse.awk source.txt 
Code: DIP403  , description: Programmatūras plānošana un metroloģija, credits: 3 
Code: DSP404  , description: Informācijas sistēmu metodoloģijas, credits: 3 
Code: DIP496  , description: Datortīklu operētājsistēmas, credits: 3 
Code: DSP403  , description: Lielu datu bāzu administrēšana, credits: 3 
Code: DPI453  , description: PROLOG un loģiskā programmēšana, credits: 3 
Code: DPI364  , description: Sadalītu datu apstrāde datoru tīklos, credits: 3 
Code: DSP425  , description: Intelektuālu sistēmu projektēšanas metodes, credits: 3 
Code: DSP427  , description: Datorsistēmu projektēšanas rīku komplekts, credits: 3 
Code: DIP418  , description: Lietišķās intelektuālās sistēmas, credits: 3 
Code: DIP419  , description: Ievads risinājumu datorizētā apstrādē, credits: 3 
Code: DPI458  , description: Lietišķā datorgrafika, credits: 3 
Code: DPI461  , description: HTML valoda, credits: 3 
Code: DSP406  , description: Projektu kvalitātes vadība, credits: 3 
Code: DSP409  , description: Multibāzu sistēmas, credits: 3 
Code: DIP497  , description: Informācijas aizsardzība, credits: 3 
Code: DIP498  , description: Lietotāja adaptīvā interfeisa programmatūra, credits: 3 
Code: DPI457  , description: Vizuālā programmēšana, credits: 3 
Code: DPI459  , description: Moderno programmēšanas valodu praktikums, credits: 3 
Code: DSP410  , description: CASE rīki datu bāzu projektēšanā, credits: 4 
Code: DIP404  , description: Programmatūras izveides tehnoloģiskie līdzekļi, credits: 4 
Code: DPI403  , description: Programmatūras risku analīze, credits: 2 
Code: DPI407, description: Modeļvadāmas programmatūras izstrādes praktikums, credits: 2 
Code: HSP430  , description: Sociālā psiholoģija, credits: 2 
Code: HSP483  , description: Industriālās attiecības, credits: 2 
Code: HSP484  , description: Psiholoģija, credits: 2 
Code: DPI002  , description: Maģistra darbs, credits: 20 
Code: DIP002  , description: Maģistra darbs, credits: 20 
Code: DSP002  , description: Maģistra darbs, credits: 20

Including UTF-8!

For the record, here is my [cmd=]locale[/cmd]:
Code:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=en_US.UTF-8
 
An XSLT version

Add the following to your table
Code:
<?xml version='1.0' encoding='UTF-8' ?>
<?xml-stylesheet href='XSL' type='text/xsl' ?>
And save the table as "ks-table.xml".

Save the following stylesheet as file 'XSL'
Code:
<?xml version="1.0" encoding='UTF-8' ?>
<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
       
<xsl:output
    method='text'
    omit-xml-declaration='no'
    indent='no'
    standalone='yes'
    doctype-public='-//W3C//DTD XHTML 1.0 Strict//EN'
    doctype-system='http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd'
/>
       
<xsl:template match='/table' >
  <xsl:call-template name='headers' />
  <xsl:call-template name='data' />
</xsl:template>
       
<xsl:template name='headers' >
  <!-- First group of headers -->
  H1=<xsl:value-of select='tr[1]/th[1]' />
  H2=<xsl:value-of select='tr[1]/th[2]' />
  H3=<xsl:value-of select='tr[1]/th[3]' />
  <!-- Second group of headers -->
  G1=<xsl:value-of select='normalize-space(tr[2]/td[1])' />
  G2=<xsl:value-of select='tr[2]/td[2]' />
  G3=<xsl:value-of select='tr[2]/td[3]' />
     <xsl:text>
</xsl:text>
</xsl:template>
       
<xsl:template name='data'>
   <xsl:for-each select='/table/tr' >
      <xsl:if test='td[1][a]' >
         <xsl:text>
</xsl:text>
         <xsl:value-of select='normalize-space(td[1][a])' />
         <xsl:text>|</xsl:text>
         <xsl:value-of select='normalize-space(td[2][a])' />
         <xsl:text>|</xsl:text>
         <xsl:value-of select='normalize-space(td[3])'  />
      </xsl:if>
    </xsl:for-each>
</xsl:template>
       
</xsl:stylesheet>
Point firefox to "/home/killasmurg/ks-table.xml" and you should see:
Code:
  H1= Kods 
  H2= Priekšmets 
  H3= Kredītpunkti 
  
  G1=A
  G2= Programmas obligātie priekšmeti 
  G3= 18 

DIP403|Programmatūras plānošana un metroloģija|3
DSP404|Informācijas sistēmu metodoloģijas|3
DIP496|Datortīklu operētājsistēmas|3
DSP403|Lielu datu bāzu administrēšana|3
DPI453|PROLOG un loģiskā programmēšana|3
DPI364|Sadalītu datu apstrāde datoru tīklos|3
DSP425|Intelektuālu sistēmu projektēšanas metodes|3
DSP427|Datorsistēmu projektēšanas rīku komplekts|3
DIP418|Lietišķās intelektuālās sistēmas|3
DIP419|Ievads risinājumu datorizētā apstrādē|3
DPI458|Lietišķā datorgrafika|3
DPI461|HTML valoda|3
DSP406|Projektu kvalitātes vadība|3
DSP409|Multibāzu sistēmas|3
DIP497|Informācijas aizsardzība|3
DIP498|Lietotāja adaptīvā interfeisa programmatūra|3
DPI457|Vizuālā programmēšana|3
DPI459|Moderno programmēšanas valodu praktikums|3
DSP410|CASE rīki datu bāzu projektēšanā|4
DIP404|Programmatūras izveides tehnoloģiskie līdzekļi|4
DPI403|Programmatūras risku analīze|2
DPI407|Modeļvadāmas programmatūras izstrādes praktikums|2
HSP430|Sociālā psiholoģija|2
HSP483|Industriālās attiecības|2
HSP484|Psiholoģija|2
DPI002|Maģistra darbs|20
DIP002|Maģistra darbs|20
DSP002|Maģistra darbs|20
Firefox performs the transformation on the fly. If you want to produce a text file, you will have to use a tool like Xalan-c or Sablotron. There is also a Perl module to do this.
 
A template to generate SQL insert statements
Code:
<xsl:template name='sql'>
   <xsl:for-each select='/table/tr' >
      <xsl:if test='td[1][a]' >
         <xsl:text>
insert into courses (id, description, credits) values</xsl:text>
         <xsl:text> ("</xsl:text> 
         <xsl:value-of select='normalize-space(td[1][a])' />
         <xsl:text>","</xsl:text>
         <xsl:value-of select='normalize-space(td[2][a])' />
         <xsl:text>","</xsl:text>
         <xsl:value-of select='normalize-space(td[3])'  />
         <xsl:text>");</xsl:text>
      </xsl:if>
    </xsl:for-each>
</xsl:template>
The results:
Code:
insert into courses (id, description, credits) values ("DIP403","Programmatūras plānošana un metroloģija","3");
insert into courses (id, description, credits) values ("DSP404","Informācijas sistēmu metodoloģijas","3");
insert into courses (id, description, credits) values ("DIP496","Datortīklu operētājsistēmas","3");
[snip]
insert into courses (id, description, credits) values ("DPI002","Maģistra darbs","20");
insert into courses (id, description, credits) values ("DIP002","Maģistra darbs","20");
insert into courses (id, description, credits) values ("DSP002","Maģistra darbs","20");
 
Thanks for your suggestions, I'll look into them as soon as I can...
Currently I'm very tired and feel exhausted
 
Dear Tired and Exhausted,

The transformation of your table with XSLT can be seen in real-time action by visiting http://silenceisdefeat.com/~j65nko/ks-table.xml with a suitable browser like firefox.

To do this from the command line you need to do something like
Code:
$ sabcmd ks-table.xsl ks-table.xml
sabcmd is part of the Sablotron XSLT processor in ports.See http://www.freshports.org/textproc/sablotron/

I know you are somewhat familiar with XML/XSLT so it shouldn't be too big a problem to hack that xsl file ;)
 

Attachments

  • ks-table.tgz
    1.3 KB · Views: 192
hmmm, finally I see some good use of xml :D
Eh will probably need to add xml and awk to my "must learn when i have time" list
 
J65nko, how to modify your XSL file to also extract link id

for example
Code:
<a href="prg/d?discId=[red]16086[/red]">
in red.... these numbers is very important....

I have 128 files with tables like in link I provided and then about 2.5K files that I can access thought these numbers....

In the and I will need to link all data, and for this I need this magic number....

P.S.
Hope what I say makes sense
 
Here is the modified 'data' template.

Code:
<xsl:template name='data'>
   <xsl:for-each select='/table/tr' >
      <xsl:if test='td[1][a]' >
         <xsl:text>
</xsl:text>
         <xsl:value-of select='normalize-space(td[1][a])' />
         <xsl:text>|</xsl:text>
       
         <!-- with the @href we retrieve the value of the href attribute -->
         <xsl:value-of select='td[1]/a/@href' />
         <xsl:text>|</xsl:text>
                 
         <!-- we use the Xpath 'substring-after' function
              See http://www.w3.org/TR/xpath#section-String-Functions -->
                 
         <xsl:value-of select='substring-after(td[1]/a/@href,"=")' />
         <xsl:text>|</xsl:text>
                 
         <xsl:value-of select='normalize-space(td[2][a])' />
         <xsl:text>|</xsl:text>
         <xsl:value-of select='normalize-space(td[3])'  />
      </xsl:if>
    </xsl:for-each>
</xsl:template>
I used my interpretation of the stepwise refinement strategy as advocated by Niklaus Wirth, father of Pascal and Modula.

So I first retrieve the complete value of the "href" attribute,
in the next step I isolated the "id" with the "substring-after" function as defined in http://www.w3.org/TR/xpath#section-String-Functions

BTW I found the 'awk' solution very smart and interesting.
But as you saw, by using awk you have to write parsing routines and the data retrieval/manipulation functions.

The advantage of using XML/XSLT is that you don't need to write a parser, you get the parser for free. So you only need to worry about the data retrieval stuff.

And you get opportunity to play with a functional language.;)

An excerpt from the result:
Code:
DIP403|prg/d?discId=9498|9498|Programmatūras plānošana un metroloģija|3
DSP404|prg/d?discId=9650|9650|Informācijas sistēmu metodoloģijas|3
DIP496|prg/d?discId=9745|9745|Datortīklu operētājsistēmas|3
DSP403|prg/d?discId=9649|9649|Lielu datu bāzu administrēšana|3
DPI453|prg/d?discId=9559|9559|PROLOG un loģiskā programmēšana|3
DPI364|prg/d?discId=9543|9543|Sadalītu datu apstrāde datoru tīklos|3
 
Back
Top