• This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn more.

download millions of files from http server

antolap

Active Member

Thanks: 2
Messages: 136

#1
I have to download million of file from an http/https server making a lot of GET request like:
Code:
GET /ALL/out/19742
GET /ALL/out/19755
GET /ALL/out/19758
GET /ALL/out/19762
GET /ALL/out/19769
GET /ALL/out/19773
GET /ALL/out/19775
GET /ALL/out/19776
GET /ALL/out/19778
I don't want to overload the server, I don't want it to crash because of tons of connections

Is there a way to make multiple GET request in one connection so that I download multiple files at once?

I think there's a better way that doing single connection for each file like this
wget -O ALL/out/999927 https://xxxxxxx/ALL/out/999927
wget -O ALL/out/999927 https://xxxxxxx/ALL/out/999928

Can you help me?
 

ekingston

Active Member

Thanks: 39
Messages: 144

#2
it is called establishing a persistent connection (it's part of the http protocol). I have no idea how to do so with wget, or how to use it if you can.

If the files are your typical text files, compression (which is supported by most web servers) is also a good idea. This is another thing that can be negotiated between the client and the server. I also have no idea how to make this happen with wget.

Sorry I can't be of more help.
 

Snurg

Aspiring Daemon

Thanks: 250
Messages: 705

#3
why not make a script that runs a particular number of wget spawns, until all files have been pulled?
 

antolap

Active Member

Thanks: 2
Messages: 136

#5
it is called establishing a persistent connection (it's part of the http protocol). I have no idea how to do so with wget, or how to use it if you can.
Yes, I'm interested in this.
I can use any program, curl, wget or any other
 

CraigW

New Member

Thanks: 18
Messages: 14

#7
Something I (or perhaps a friend *wink*) once used while being gentle to a website.

Might make a starting point...

wget --user-agent='Mozilla/4.0 (compatible ; MSIE 6.0 ; Window NT 5.1)' --limit-rate=750k -x -r -l2 -np -c --random-wait=on --wait=2 -x -r -l2 -np https://XXXXXXX/1/items/usgs_drg_il_37089_a4/*
 

antolap

Active Member

Thanks: 2
Messages: 136

#8
By default, wget uses persistant connections; no special option is necessary.
but If I run wget file1 then wget file2 then wget file3, I suppose that for each file there's a new connection
when wget file2 is executed, wget file1 has finished, so it has closed it's connection
 

usdmatt

Daemon

Thanks: 419
Messages: 1,210

#9
Try passing somewhere between 10 - 50 files in the same wget command line. It should use the same connection automatically if multiple files are from the same host. (You can also specify a file to read from looking at the man page although I'd be hesitant to just give it a file with 1 million entries)

The method suggested by CraigW seems good but I can't quite see how it knows what files are available unless directory indexes are enabled?

When passing multiple files on the command line it looks like you might have to specify wget full-url1 full-url2 full-url3. Would be nice if you could do something like wget --base=http://website/some/path/ file1 file2 file3 but I can't find anything like that. You may need to watch the maximum command line length (not sure what that is off the top of my head).
 

usdmatt

Daemon

Thanks: 419
Messages: 1,210

#11
You could probably do wget (options) http://url/path/{file1,file2,file3,file4,etc} and let the shell expand it for you. Still ends up actually running a long command line though.
 

Ponticelli

New Member

Thanks: 1
Messages: 1

#12
You could use a list of URLs and feed that into wget.

-i file
--input-file=file
Read URLs from a local or external file. If - is specified as
file, URLs are read from the standard input. (Use ./- to read from
a file literally named -.)