Solved How to 'cp -a' from a website

balanga · 2026-02-18T10:55:26+0000

I want to copy a subdirectory tree from a website, effectively cp -a remote-host/dir ..

How would I do that?

I can download files individually via my browser but would like to duplicate the remote directory.

SirDice · 2026-02-18T10:56:02+0000

You know scp(1) exists? Other than that, wget(1) can download directories, but DirectoryIndex has to be enabled, there's no way to figure out the contents of a web directory through the regular HTTP(S) protocol.

Isoux · 2026-02-18T11:19:32+0000

You can also use rsync if you have SSH access — it is usually a better equivalent to cp -a over the network:

 

rsync -avz user@remote-host:/path/to/dir .

For HTTP-only access, wget --mirror works, but as mentioned, directory listing must be enabled; otherwise there is no generic way to enumerate files over plain HTTP.

DutchDaemon · 2026-02-18T11:22:31+0000

Furthermore:
net/rclone
net/rsync

eternal_noob · 2026-02-18T11:53:00+0000

https://www.baeldung.com/linux/curl-download-all-files-from-directory

Alain De Vos · 2026-02-18T12:21:26+0000

sshfs + clone

victort · 2026-02-18T13:35:17+0000

Depending on which files you want, you can try this tool.

GitHub - tschettervictor/shcrapy: a tool to collect links from a website

a tool to collect links from a website. Contribute to tschettervictor/shcrapy development by creating an account on GitHub.

github.com

This saves the given file extensions to an output file which you can then input into any downloaded program.

kpedersen · 2026-02-18T13:50:02+0000

I recall httrack being an OK web archiver.

balanga · 2026-02-18T15:46:05+0000

SirDice said:
You know scp(1) exists? Other than that, wget(1) can download directories, but DirectoryIndex has to be enabled, there's no way to figure out the contents of a web directory through the regular HTTP(S) protocol.

It's a public website.

I had forgotten that wget does a recursive retrieval so I did that, but got a ton of html files which I don't want.

Not sure if there is a straightforward way of deleting them all.

balanga · 2026-02-18T15:49:10+0000

kpedersen said:
I recall httrack being an OK web archiver.

I'm not bothered about copying a website.

What I wanted was to just get the files from

Index of /debian/dists/trixie/main/installer-amd64/current/images/netboot

eternal_noob · 2026-02-18T15:49:13+0000

wget has a file reject list you can use. (--reject rejlist)

e.g. --reject jpg,png --accept html

(or the other way round)

GNU Wget 1.25.0 Manual

www.gnu.org

kpedersen · 2026-02-18T17:41:37+0000

balanga said:
I'm not bothered about copying a website.

What I wanted was to just get the files from

Index of /debian/dists/trixie/main/installer-amd64/current/images/netboot

To clarify, that should recursively go through every link and fetch the data. The fact it is a "website" is not quite so important.

It should also have an ftp http mirror: http://ftp.us.debian.org/debian/dists/trixie/main/installer-amd64/current/images/netboot/

Though if this is Debian specifically, I think apt-mirror tends to be a good archiving solution.

Alain De Vos · 2026-02-18T19:19:15+0000

wget -r -np no parent

blackbird9 · 2026-02-18T20:33:09+0000

Lftp is a nice program for mirroring websites. From the manpage:-

"lftp has built-in mirror which can download or update a whole directory
tree. There is also reverse mirror (mirror -R) which uploads or updates
a directory tree on server. Mirror can also synchronize directories be‐
tween two remote servers, using FXP if available."

Lftp runs as an interactive session. See the description of the 'mirror' command in the manpage; basically you mirror a remote directory to a local one.

balanga · 2026-02-18T20:57:34+0000

I have now removed all the gunk from the download and copied all the files onto my PXE server from which I was able to install Debian with little effort.

Having the same facility for FreeBSD would be nice, and I wouldn't be surprised if someone has already put together such a package, although I have not yet come across such a thing.

astyle · 2026-02-18T21:03:22+0000

Adding to what Alain De Vos mentions, the wget manpage also mentions that you can use the -R flag to filter out the stuff you don't need... And wget is available for FreeBSD as ftp/wget...

balanga · 2026-02-18T21:21:13+0000

astyle said:
Adding to what Alain De Vos mentions, the wget manpage also mentions that you can use the -R flag to filter out the stuff you don't need... And wget is available for FreeBSD as ftp/wget...

I always use fetch to retrieve files and forget about wget. It would be nice if fetch could do recursive retrieval

astyle · 2026-02-18T21:32:26+0000

balanga said:
I always use fetch to retrieve files and forget about wget. It would be nice if fetch could do recursive retrieval

fetch(1) doesn't do specifically recursive retrieval on its own. Read up the manpage. You can, however, write a .sh script that implements a recursive retrieval using fetch(1). For recursive retrieval, use wget.

Solved How to 'cp -a' from a website

balanga

SirDice

Administrator

Isoux

DutchDaemon

Administrator

eternal_noob

Alain De Vos

victort

GitHub - tschettervictor/shcrapy: a tool to collect links from a website

kpedersen

balanga

balanga

eternal_noob

GNU Wget 1.25.0 Manual

kpedersen

Alain De Vos

blackbird9

balanga

astyle

balanga

astyle