2244 Web-page content grabber [Archive] - The FreeBSD Forums

PDA

View Full Version : Web-page content grabber


Oko
November 5th, 2009, 04:58
Hi All,

My daughter would like to grab/download the source code of some web-page she likes. She wants html, css, scripts and everything else downloaded in separate files. For instance if
the page is index.html and the layout is determined by style.css
she want to grab both files onto local machine.

Can that be done automatically? I am familiar with firebug addon as well as various Perl Scripts that can grab the html content.
She claims that she heard of some kind of software that can do that.

Cheers,
OKO

DutchDaemon
November 5th, 2009, 05:16
Firefox? "Save Page As" - "Web Page, Complete". Downloads everything, html, php, css, javascript, gifs, jpegs, the lot.

Oko
November 5th, 2009, 06:10
Dutch,

Thank you so much for your promt answer. I will ask you one more stupid question. Is there are way that she can pull out the content of the "whole
web-site". By "whole web-site" she means the all web-pages that are directly linked to the initial web-page and reside on the same web-server.
In another words my daughter wants to recreate somebodys elaborate web-site on her local computer.

Sorry, I promise this is the last one. It took me half an hour to understand what she was talking about because to me it looked like ill-posed problem involving infinite recursion.

DutchDaemon
November 5th, 2009, 06:20
Ah, you will venture into the domain of web crawlers. There are some in the ports tree. I've not used any of them, but they basically do the same thing (I'm talking from experience based on some Windows-based crawl clients (Telenet, or something?) here, long ago).

They will usually allow you to specify the URL to start crawling from, the depth to crawl, whether or not to leave the website (domain) when crawling, the type of files to download, the size of files to download (min/max size), and even whether or not to honour the robots.txt exclusions file.

I have no idea how user-friendly these programs are (GUI, console), so I'll just point you to them and leave it up to you to .. crawl through them, I guess.

www/crawl -- looks like 'pics only', maybe adjustable
www/larbin -- search engine-type crawler, maybe more 'indexing' than 'downloading/mirroring'
www/webcrawl -- seems to fit the bill best (total site copy)
www/scloader -- not much info, regular crawler/downloader
www/momspider -- looks a bit complex and 'academic'

vermaden
November 5th, 2009, 11:02
@Oko

Also check things like Opera Dragonfly and Firefox Firebug.

Beastie
November 5th, 2009, 11:04
Anything wrong with ftp/wget?

Don't be deceived by the ftp/, it works for HTTP too. Plus it's very easy to use... if one bothers to read the fabulous manual, of course.

e.g.:
wget --continue --no-parent --recursive --convert-links --progress=bar --tries=10 --exclude-directories=excl,uded,dir,ecto,ries http://www.website.com/directory/index.html

SPlissken
November 5th, 2009, 14:27
Personnally i use this one httrack
http://www.freebsdsoftware.org/www/httrack.html

Oko
November 5th, 2009, 18:41
Anything wrong with ftp/wget?

Don't be deceived by the ftp/, it works for HTTP too. Plus it's very easy to use... if one bothers to read the fabulous manual, of course.

e.g.:
wget --continue --no-parent --recursive --convert-links --progress=bar --tries=10 --exclude-directories=excl,uded,dir,ecto,ries http://www.website.com/directory/index.html

Can wget download style.css files? I was experimenting yesterday with wget but the man pages are long and I was not sure if it can grab things recursively.

Oko
November 5th, 2009, 18:42
www/webcrawl -- seems to fit the bill best (total site copy)

It is not ported to OpenBSD so I quickly created a port. Guess what? I keep getting core dump. I have not had chance to debug the code but it looks like it might have a serious problem. I also noticed the it is not ported to NetBSD.

Oko
November 5th, 2009, 18:46
@Oko

Also check things like Opera Dragonfly and Firefox Firebug.
Thanks Vermaden! I know for Firebug but she wants to snap the whole web-site from the server and recreate the things locally.
I have not used Opera Dragonfly before. Yesterday I played little bit. We do all sorts of things for our kids:-)

Beastie
November 5th, 2009, 23:00
Can wget download style.css files? I was experimenting yesterday with wget but the man pages are long and I was not sure if it can grab things recursively.
Sure. It's usually (like under almost every GNU/Linux distro) used to download single files only, kinda like fetch, but with recursion enabled (-r or --recursive), it downloads *every single file* it has access to, just like any other web crawler.
Check the example from my last post and read the man description for the options I used. They are the most useful. You can also exclude specific extension with -R or --reject if you want.



@DutchDaemon: why have you edited my last post? Not that I have any problem with it, but the command should be a single line.

drhowarddrfine
November 6th, 2009, 02:14
splissken, above, had the right program. Use httrack but I do like wget also.

DutchDaemon
November 6th, 2009, 02:15
@Beastie - Long code lines tend to run off of the page, but it wasn't so bad here, so I changed it back.

0