Storing 2.000.000 files on HDD

gege · Jan 24, 2011

Hi everybody,

We are using FreeBSD on our servers, and we need to store around 2.000.000 files (there is around 3000 uploaded files per day), 1 file is around 50kb. Please suggest how to format hard drive, which file system we should use to get the best performance (IO) or some other approach not build on java, storing files in FS is preferable.

For now we are using

Code:

/dev/mfid0s1d on /usr (ufs, local, noatime, soft-updates)

and storing files in:
/files/1/2/3/123456.gz
with this directory hierarchy we got around 2000 files in 1 directory.

I got in mind to change it to:
/files/8/1/d/81dc9bdb52d04dc20036dbd8313ed055.gz
storing filename as md5($filename) and we finish with around 500 files per directory.

Also I am considering this hierarchy:
/files/81/dc/81dc9bdb52d04dc20036dbd8313ed055.gz
which will be 30 files in directory.

How you would store 2.000.000 files ?

Alt · Jan 24, 2011

First, I want to say usings hashing is better. This will lead to more optimized files count in directories (they will have nearly same files count). But dunno about MD5. This is slow algorythm isnt? Maybe SHA1 or whatever is better?

gege said:
/files/81/dc/81dc9bdb52d04dc20036dbd8313ed055.gz
which will be 30 files in directory.

There will be near 961 subdir in "/files/" and about 961 in "81" directories. ( a-z, 0-9 = 36 chars, 36^2=961 ). Afaik you can get very bad perfomance on ufs when there is too many files (correct me if im wrong). So, i suggest previous schema is better:

gege said:
/files/8/1/d/81dc9bdb52d04dc20036dbd8313ed055.gz

I guess with this you can get:

Code:

0lvl directory ("/files/") = 36dirs
1lvl ("/files/8/") = 36dirs
2lvl ("/files/8/1/") = 36dirs
3lvl ("/files/8/1/d/") = 2m / 36^3 = 2m / 46k  = ~ 42 file per directory.

So i think its best schema in your case.

Pushrod · Jan 24, 2011

I bet using SQL would be way faster, and would not require computing hashes or other slow operations.

vermaden · Jan 24, 2011

What about noSQL: http://en.wikipedia.org/wiki/NoSQL

Alt · Jan 24, 2011

Pushrod said:
I bet using SQL would be way faster, and would not require computing hashes or other slow operations.

Really hashes are computing only when you upload, so its not great perfomance loss. In other side BLOBs is not too fast and, if those files must be downloadable its great idea to store data as files. On FreeBSD you can use sendfile(2) within nginx or other websever, which will give you great speed advantage, because you no need cgi-scripts or whatever.

User23 · Jan 24, 2011

If you are using UFS keep an eye on the ufs dirhash buffer.

[CMD=]sysctl vfs.ufs | grep hash[/CMD]

Code:

vfs.ufs.dirhash_reclaimage: 5
vfs.ufs.dirhash_lowmemcount: 247
vfs.ufs.dirhash_docheck: 0
vfs.ufs.dirhash_mem: 1205838
vfs.ufs.dirhash_maxmem: 2097152
vfs.ufs.dirhash_minsize: 2560

If vfs.ufs.dirhash_mem reaches vfs.ufs.dirhash_maxmem, just increase the maxmem.

gege · Jan 24, 2011

thanks for answers.

I was storing these files in mysql before, also I was considering using MongoDB and such, but everything got own cons. I want store them as files, they should be served in lighttpd (I am not using nginx yet)

Ok, so we will use UFS as filesystem, any hints on blocksize and some tuning ?

For storing many files in 1 directory - I know hashing is the best, md5 should be fine, should be also faster than sha1, but thats minimum overload. IDs should do also job, for example I got file "12345.gz" so I can save it as files/5/4/3/12345.gz (reversed order, filenames are autoincrement IDs)

So I am considering two variants:

1. /files/81/dc/81dc9bdb52d04dc20036dbd8313ed055.gz
this will be 256 dirs and in each of then 256 dirs.so 256^2 = 65536, 30 files per directory. I am not sure if it is not directory overload. I think, and here I need help, that optimal (or maximum optimal) files per directory without decreasing performance is around 256.

vs

2. /files/8/1/d/81dc9bdb52d04dc20036dbd8313ed055.gz
this will be 16^3 = 4096 directories, 488 files per directory, which is IMO quite high.

Now I got idea, without hashing:
file 123456.gz would be stored /files/6/5/4/3/123456.gz which will end in 200 files per directory, which seems to be optimal and will save some hashing overload.

User23, thanks for hashing hint.

Alt · Jan 24, 2011

gege said:
1. /files/81/dc/81dc9bdb52d04dc20036dbd8313ed055.gz
this will be 256 dirs and in each of then 256 dirs.so 256^2 = 65536
...
2. /files/8/1/d/81dc9bdb52d04dc20036dbd8313ed055.gz
this will be 16^3 = 4096 directories, 488 files per directory, which is IMO quite high.

Where did you got these numbers??

1. 36 variants for each char (a-z, 0-9), 36^2=1296 lvl1 dirs, 36^2*lvl1=1296*1296=1,679,616 lvl2 dirs. So its about 1-2 file in each directory.
2. 36variants for each char, 3-level dirs. For 2miln files its about 42 files per directory. Read calculations above (post #2)

gege · Jan 24, 2011

Alt, hashes are [0-9a-f].

Alt · Jan 24, 2011

Ah, sry. Ok, then probably is better to use 4-lvl subdirectories its should be about 30 files per dir in that case..

gege · Jan 25, 2011

just one more point.

What do you think is better (assuming I will using IDs, not hashes):

/files/4/3/2/1/1234.gz

or

/files/43/21/1234.gz