Web-page content grabber

Oko · Nov 5, 2009

Hi All,

My daughter would like to grab/download the source code of some web-page she likes. She wants html, css, scripts and everything else downloaded in separate files. For instance if
the page is index.html and the layout is determined by style.css
she want to grab both files onto local machine.

Can that be done automatically? I am familiar with firebug addon as well as various Perl Scripts that can grab the html content.
She claims that she heard of some kind of software that can do that.

Cheers,
OKO

DutchDaemon · Nov 5, 2009

Firefox? "Save Page As" - "Web Page, Complete". Downloads everything, html, php, css, javascript, gifs, jpegs, the lot.

Oko · Nov 5, 2009

Dutch,

Thank you so much for your promt answer. I will ask you one more stupid question. Is there are way that she can pull out the content of the "whole
web-site". By "whole web-site" she means the all web-pages that are directly linked to the initial web-page and reside on the same web-server.
In another words my daughter wants to recreate somebodys elaborate web-site on her local computer.

Sorry, I promise this is the last one. It took me half an hour to understand what she was talking about because to me it looked like ill-posed problem involving infinite recursion.

DutchDaemon · Nov 5, 2009

Ah, you will venture into the domain of web crawlers. There are some in the ports tree. I've not used any of them, but they basically do the same thing (I'm talking from experience based on some Windows-based crawl clients (Telenet, or something?) here, long ago).

They will usually allow you to specify the URL to start crawling from, the depth to crawl, whether or not to leave the website (domain) when crawling, the type of files to download, the size of files to download (min/max size), and even whether or not to honour the robots.txt exclusions file.

I have no idea how user-friendly these programs are (GUI, console), so I'll just point you to them and leave it up to you to .. crawl through them, I guess.

www/crawl -- looks like 'pics only', maybe adjustable
www/larbin -- search engine-type crawler, maybe more 'indexing' than 'downloading/mirroring'
www/webcrawl -- seems to fit the bill best (total site copy)
www/scloader -- not much info, regular crawler/downloader
www/momspider -- looks a bit complex and 'academic'

vermaden · Nov 5, 2009

@Oko

Also check things like Opera Dragonfly and Firefox Firebug.

Beastie · Nov 5, 2009

Anything wrong with ftp/wget?

Don't be deceived by the ftp/, it works for HTTP too. Plus it's very easy to use... if one bothers to read the fabulous manual, of course.

e.g.:

Code:

wget --continue --no-parent --recursive --convert-links --progress=bar --tries=10 --exclude-directories=excl,uded,dir,ecto,ries http://www.website.com/directory/index.html

SPlissken · Nov 5, 2009

Personnally i use this one httrack
http://www.freebsdsoftware.org/www/httrack.html

Oko · Nov 5, 2009

Beastie said:
Anything wrong with ftp/wget?

Don't be deceived by the ftp/, it works for HTTP too. Plus it's very easy to use... if one bothers to read the fabulous manual, of course.

e.g.:

Code:

wget --continue --no-parent --recursive --convert-links --progress=bar --tries=10 --exclude-directories=excl,uded,dir,ecto,ries http://www.website.com/directory/index.html

Can wget download style.css files? I was experimenting yesterday with wget but the man pages are long and I was not sure if it can grab things recursively.

Oko · Nov 5, 2009

DutchDaemon said:
www/webcrawl -- seems to fit the bill best (total site copy)

It is not ported to OpenBSD so I quickly created a port. Guess what? I keep getting core dump. I have not had chance to debug the code but it looks like it might have a serious problem. I also noticed the it is not ported to NetBSD.

Oko · Nov 5, 2009

vermaden said:
@Oko

Also check things like Opera Dragonfly and Firefox Firebug.

Thanks Vermaden! I know for Firebug but she wants to snap the whole web-site from the server and recreate the things locally.
I have not used Opera Dragonfly before. Yesterday I played little bit. We do all sorts of things for our kids

Beastie · Nov 5, 2009

Oko said:
Can wget download style.css files? I was experimenting yesterday with wget but the man pages are long and I was not sure if it can grab things recursively.

Sure. It's usually (like under almost every GNU/Linux distro) used to download single files only, kinda like fetch(1), but with recursion enabled (-r or --recursive), it downloads *every single file* it has access to, just like any other web crawler.
Check the example from my last post and read the man description for the options I used. They are the most useful. You can also exclude specific extension with -R or --reject if you want.

@DutchDaemon: why have you edited my last post? Not that I have any problem with it, but the command should be a single line.

drhowarddrfine · Nov 6, 2009

splissken, above, had the right program. Use httrack but I do like wget also.

DutchDaemon · Nov 6, 2009

@Beastie - Long code lines tend to run off of the page, but it wasn't so bad here, so I changed it back.

Web-page content grabber

Oko

DutchDaemon

Administrator

Oko

DutchDaemon

Administrator

vermaden

Beastie

SPlissken

Oko

Oko

Oko

Beastie

drhowarddrfine

DutchDaemon

Administrator