How-to Get Web Page Source ?

iic2 · May 16, 2009

I been looking into PERL and even wondering how to do this in asm on a freeBSD running Apache and now I don't think it need all of that serious coding other than for pausing the text to extract needed information. I thought it was about Spi*ers, Ro*ots and bot, something magical. But all I need is a separate timed process or timed script to grab a simple web page complete TEXT-BASED-ONLY with-out the use of an browser... No HTML rendering. Go-get page and save that text to disk at 12:01am each and every night. And that's it.

Example:
http://www.wunderground.com/US/IN/Bloomington.html

After I grab the complete source page than I'll compare changes to previous 5-Day Forecast for ZIP Code 47401 and extract that info and update my page with the same Forecast nightly. I would also include the link to this public information site on my page as an matter of respect.

The hard stuff comes next. I'll worry about how to pause it for changes in the sky latter. Just getting the text on the disk is what I need to do right now. It may be simple but I never done it before.

Thanks in advance

DutchDaemon · May 16, 2009

fetch(1)
ftp/curl

vivek · May 16, 2009

You can use above tools and a combination of shell & friends. Another and recommend option is perl LWP browser:

Code:

#!/usr/bin/perl
use LWP::UserAgent;
$ua = LWP::UserAgent->new;
#$ua->agent("$0/0.1 " . $ua->agent);
# pretend we are very capable browser
$ua->agent("Mozilla/8.0") 
$req = HTTP::Request->new(GET => 'http://www.wunderground.com/US/IN/Bloomington.html');
$req->header('Accept' => 'text/html');


$res = $ua->request($req);

# check the outcome
 if ($res->is_success) {
    print $res->content;
    # or do something else
if($res->content =~ m/zip-code/i) {
   # do-something
 }

 } else {
    print "Error: " . $res->status_line . "\n";
 }

Another one liner:

Code:

perl -MLWP::Simple -e "getprint 'http://example.com/page.html'"  | grep "something|regex" | do_something

See http://www.perl.com/pub/a/2002/08/20/perlandlwp.html

Mel_Flynn · May 16, 2009

This is easy with php + pcre and curl modules and can run from cron, as cli module, so you don't need it in apache. Apache can just SSI the file or you can write the output as json in the webroot, then use javascript to display the info.
The only drawback, no matter what implementation you choose, is that you need to maintain the code, for site HTML changes. If the target site provides an RSS service that suits you, you will make your life a lot easier.

mwatkins · May 17, 2009

Welcome to the world-wide-web of screen scraping. Since in your case you want to do targeted scraping, you really want to use some sort of DOM or other XML/XHTML query facility. Might sound difficult and esoteric but believe me, it'll be easier for you in the long run and I'll post a couple examples here to prove it.

If you wish to collect specific data off one or more pages, process it, and then publish something new from it, you are an ideal candidate for some very effective Python-based tools.

Getting the source, dirt easy (all code samples are Python, I'm using PHP highlighting for nicer reading only):

PHP:

from urllib2 import urlopen
data = urlopen('http://www.wunderground.com/US/IN/Bloomington.html').read()

Want to parse it? How solid are your (X)HTML skills? CSS? You'll need some experience there to make short work of it -- the tools are simple though if you take an approach similar to the Javascript JQuery tool:

PHP:

from pyquery import PyQuery
from lxml.html import fromstring, tostring, parse

VAN_WX = 'http://www.weatheroffice.gc.ca/city/pages/bc-74_metric_e.html'

def get_weather(wx=VAN_WX, image_only=True):
    assert wx
    xq = xquery(url=wx)
    if image_only:
        return str(xq("#currentcond-left img"))
    else:
        return str(xq("#currentcond"))

def xquery(url=None, file=None, content=None, relative_links=False, **kwargs):
    """
    A factory producing a PyQuery object; unless or until PyQuery adds a
    make_links_absolute feature, this saves me some typing. kwargs is passed on
    to the PyQuery object, without any filename and url items. See the PyQuery
    and lxml apis.

    Default argument(s):
        relative_links : False (convert all to absolute)
    """
    if url or file:
        doc = parse(url or file, **kwargs).getroot()
    elif content:
        doc = fromstring(content, **kwargs)
    else:
        raise ValueError('Must specify url, file, or content')
    if not relative_links and doc.base_url:
        doc.make_links_absolute()
    return PyQuery(tostring(doc))

if __name__ == '__main__':
    print(get_weather(image_only=False))

The above is working code I hacked together following a few minutes of playing with lxml and PyQuery; if you have Python, PyQuery and lxml installed, it'll run.

The wunderground site appears to have lots of detail in the HTML attributes - I've no doubt that you'll be able to pull out the content easily enough... oh, darn, its so darn easy. Make the xquery function above importable or paste it into a Python session and do this:

PHP:

$ python
>>> from mylib.xquery import xquery
>>> url = 'http://www.wunderground.com/US/IN/Bloomington.html'
>>> x = xquery(url=url)
>>> for a in x('#curcondbox span.pwsrt'):
...   print a.values()
['pwsrt', 'KINBLOOM20', 'metric', 'lu', '1242521331']
['pwsrt', 'KINBLOOM20', 'metric', 'tempf', u'\xb0F', u'\xb0C', '58.1']
['pwsrt', 'KINBLOOM20', 'metric', 'humidity', '', '', '51']
['pwsrt', 'KINBLOOM20', 'metric', 'dewptf', u'\xb0F', u'\xb0C', '40']
['pwsrt', 'KINBLOOM20', 'metric', 'windspeedmph', 'mph', 'km/h']
['pwsrt', "window.wind_animate['CONDBOXWIND']", 'KINBLOOM20', 'metric', 'winddir', '', '', 'SSW']
['pwsrt', 'KINBLOOM20', 'metric', 'windgustmph', 'mph', 'km/h']
['pwsrt', 'KINBLOOM20', 'metric', 'baromin', 'in', 'hPa', '29.29']

Easy enough?

Invest some time in learning Python. It'll repay you many times over.

iic2 · May 17, 2009

mwatkins, you explain this so well. I just completed my first full year of Web Programming using Deitel textbook and SQL classes. I got straight (A's) so I do understand but it be another year before we get to see anything close to this. And another year just to get to UNIX.

This what I call real Web Programming. By September of this year be sure to re-check this thread to see what I done with your code. You're going to love it. Now I got a chance to sleep two years of classes cause everything is here that I need to learn and I'll have it down pack by next semester. I'm so excited.

Thank you sooo MUCH
Time to go to work

Sorry guys, I forgot to say thanks to you also. I'm not over-looking anything. I understand now. I'll be building my tool box for weeks. I finally get it now.

http://effbot.org/zone/element-index.htm

mwatkins · May 18, 2009

You are very welcome.

Glad to see you found the effbot - his site is very useful, and Fredrick's ElementTree is part of the Python distribution now. I've used both ET and lxml - I'm struggling at this moment to remember why I've been using lxml more lately but it may be as simple as PyQuery makes some tasks ridiculously easy. All these tools have their place and as you work with them you'll find where they all fit best.

How-to Get Web Page Source ?

iic2

Guest

DutchDaemon

Administrator

vivek

Mel_Flynn

mwatkins

iic2

Guest

mwatkins