download millions of files from http server

antolap · Dec 13, 2017

I have to download million of file from an http/https server making a lot of GET request like:

Code:

GET /ALL/out/19742
GET /ALL/out/19755
GET /ALL/out/19758
GET /ALL/out/19762
GET /ALL/out/19769
GET /ALL/out/19773
GET /ALL/out/19775
GET /ALL/out/19776
GET /ALL/out/19778

I don't want to overload the server, I don't want it to crash because of tons of connections

Is there a way to make multiple GET request in one connection so that I download multiple files at once?

I think there's a better way that doing single connection for each file like this
wget -O ALL/out/999927 https://xxxxxxx/ALL/out/999927
wget -O ALL/out/999927 https://xxxxxxx/ALL/out/999928

Can you help me?

ekingston · Dec 13, 2017

it is called establishing a persistent connection (it's part of the http protocol). I have no idea how to do so with wget, or how to use it if you can.

If the files are your typical text files, compression (which is supported by most web servers) is also a good idea. This is another thing that can be negotiated between the client and the server. I also have no idea how to make this happen with wget.

Sorry I can't be of more help.

Snurg · Dec 13, 2017

why not make a script that runs a particular number of wget spawns, until all files have been pulled?

aragats · Dec 13, 2017

antolap said:
wget -O ALL/out/999927 https://xxxxxxx/ALL/out/999927
wget -O ALL/out/999927 https://xxxxxxx/ALL/out/999928

If you know that all those files (and only them) are in a single directory, recursive and no-parents options may work:
wget -r -np https://xxxxxxx/ALL/out

antolap · Dec 13, 2017

ekingston said:
it is called establishing a persistent connection (it's part of the http protocol). I have no idea how to do so with wget, or how to use it if you can.

Yes, I'm interested in this.
I can use any program, curl, wget or any other

ljboiler · Dec 13, 2017

By default, wget uses persistant connections; no special option is necessary.

CraigW · Dec 13, 2017

Something I (or perhaps a friend *wink*) once used while being gentle to a website.

Might make a starting point...

 wget --user-agent='Mozilla/4.0 (compatible ; MSIE 6.0 ; Window NT 5.1)' --limit-rate=750k -x -r -l2 -np -c --random-wait=on --wait=2 -x -r -l2 -np https://XXXXXXX/1/items/usgs_drg_il_37089_a4/*

antolap · Dec 13, 2017

ljboiler said:
By default, wget uses persistant connections; no special option is necessary.

but If I run wget file1 then wget file2 then wget file3, I suppose that for each file there's a new connection
when wget file2 is executed, wget file1 has finished, so it has closed it's connection

usdmatt · Dec 14, 2017

Try passing somewhere between 10 - 50 files in the same wget command line. It should use the same connection automatically if multiple files are from the same host. (You can also specify a file to read from looking at the man page although I'd be hesitant to just give it a file with 1 million entries)

The method suggested by CraigW seems good but I can't quite see how it knows what files are available unless directory indexes are enabled?

When passing multiple files on the command line it looks like you might have to specify wget full-url1 full-url2 full-url3. Would be nice if you could do something like wget --base=http://website/some/path/ file1 file2 file3 but I can't find anything like that. You may need to watch the maximum command line length (not sure what that is off the top of my head).

antolap · Dec 14, 2017

usdmatt said:
Would be nice if you could do something like wget --base=http://website/some/path/ file1 file2 file3 but I can't find anything like that.

It would be very useful

usdmatt · Dec 14, 2017

You could probably do wget (options) http://url/path/{file1,file2,file3,file4,etc} and let the shell expand it for you. Still ends up actually running a long command line though.

Ponticelli · Dec 14, 2017

You could use a list of URLs and feed that into wget.

-i file
--input-file=file
Read URLs from a local or external file. If - is specified as
file, URLs are read from the standard input. (Use ./- to read from
a file literally named -.)