UFS rsync millions of files inside a directory

I use rsync to create backups for a directory with 5.000.000 files. Would it be faster if I split the files in 100 directories (50.000 files per directory)? I want to run only 1 rsync to sync everything and not one per directory.
 
I know but these images created using XML import to a WooCommerce e-shop. I try to find a way to split them in different directories.
 
Note when you make a UFS filesystem with "newfs" you have the option :
-h avgfpdir
The expected average number of files per directory on the file
system.


PS: You could split your directory based on the first & second character ?
 
I run rsync for example in /home/user/files . If I create /home/user/files/1 , /home/user/files/2 , /home/user/files3, etc and split the files inside multiple sub-directories and still run rsync in /home/user/files then it's the same speed?
 
I believe this topic addresses a very similar question: https://serverfault.com/questions/746551/faster-rsync-of-huge-directory-which-was-not-changed

Another idea, from my experience syncing with rsync over the network can be quite slow in some situations (like yours). Sometimes however if you tar-gzip your files and transfer them as a single archive file it is way faster. I had this 2 weeks ago when transfering ~100 GB of Nextcloud files in JPG format. By simply tar-ing the files locally in a big archive the estimated copy time was decreased 7-fold.
 
I tested for you on my machine if rsync would work. Locally, there was no issue.

Creating the files:
Python:
for k in range(5000000):
    with open('file-'+str(k) + ".txt", 'w') as f:
        f.write('Lorem ipsum dolor sit amet ...')

It took ~5 minutes. No issues.

Rsync:
Bash:
time rsync -ahisvP user/ user2/
It took ~9 minutes, no issues.

If you change the 'v' with a 'q' it might go even a little bit faster.

File system: ZFS, raidz1 with 5 HDDs, 1TB each.
 
My experience: The file system isn't relevant - sooner or later you will get into more than one trouble (syncing is one of it - the remote system might be given and not FreeBSD…); Such a directory isn't handleable. Been there (not my fault).

WooCommerce should solve that. Really. If it doesn't it's bad designed… (or not ready for your requirement).

One solution: In the result you need the file (maybe an image, also with automatic created thumbnails / different sizes), and a related database record. The database record needs a unique identifier to access the media file ("ID"), which can simply be used inside the directory structure. To split them in different directories you can use a random number generator, in your case maybe from 1000 to 9999. So first you make your database entry, and get the new ID; Also add a random prefix in a separated table column. Then create your directory like …/prefix/media_id/, and store your file there; If it is an image create your instances / thumbnails of it in the same target directory.

Here's an example when uploading / adding a file to a CMS: First generate a random prefix number:
Code:
prefix :: 7358
Make your record in the database:
Code:
id :: 8264837
prefix :: 7358
filename :: my_file.jpg
1920 :: 1920.png
960 :: 960.png
thumbnail :: thumb.jpg
Store your file:
Code:
media/7358/8264837/my_file.jpg
Create your instances:
Code:
media/7358/8264837/1920.png
media/7358/8264837/960.png
media/7358/8264837/thumb.jpg
With such a solution you can handle a really large amount of media files - and that without the need of rewriting filenames on request… (means: you can store more that one file named "article.jpg"). Should be basics for a CMS.
 
Forgot the rsync thing…: When reaching limits, create your own. Make a directory list on the source (f.e. PHP can read larger directories than rsync or other GNU tools - at least many years ago…), and transfer the list to the target; Create a list there (with the same script), and compare them. The difference has to be handled. If you've used rsync over SSH, you can use f.e. scp for file tranfers. If the source is a webserver you can use fetch, curl or wget for file transfers as well as generating the hosts list: Make a script on the target machine that requests the list on the source, afterwards your script makes its own list and goes through it, fetches and deletes files. It's not complicated - the only thing is to find tools / languages that can exceed rsyncs limit of reading file lists.
 
Whether a file system or tool can handle huge numbers of entries per directory depends on how it is implemented. For some file systems, 5 million files in a single directory is absolutely trivial, as they are designed to handle a billion entries per directory. Similarly, if one writes tools (such as rsync) carefully, they should scale linearly with the number of files they process at once. In that case, putting all 5 million files into a single directory and operating on them at once is exactly the right thing to do, as it has the lowest complexity, and reduces the overall number of file system objects (files + directories) to the minimum.

The limit might be that tools may have to keep a complete list of files (names and attributes) in memory, and at 5 million files and a reasonable guess of ~200 bytes per object, that means allocating a gigabyte of memory for the process. On a reasonable production system today, that should be possible.

Note that the overall speed will be determined by the combination of file system and tool. Even the most efficient file system on the planet can be brought to its knees by a tool that does things in a bad way, and vice versa. I can easily sketch out a toy version of rsync that is O(n2) in the number of files it has to process, which would be catastrophic on a directory with millions of files.

Now, to answer your specific question: Will it be faster if you split the directory? You'll just have to test that. It depends on what file system you are using, on the size of the files themselves, on how rsync iterates over files.

And a general comment, perhaps not applicable here: Often file systems get abused for a huge number of small (less than a disk block or memory page) and fixed-size objects. In such cases, a database is likely to be a much more efficient solution. Similarly, sometimes the workflow (for example using the map/reduce paradigm) creates lots of tiny files in intermediate stages. In that case, merging them into fewer large files is usually much more efficient.
 
What kills me on directories which too many files is when I chdir to that directory in a shell that supports tab expansion or I use a command with a wildcard. Gets very slow.
 
Been there, inherited that...

rsync(1) uses ssh(1) as a transport agent, which has encryption enabled by default. The encryption slows things down. Often this won't be noticed if you have transport over a relatively slow network, but yours is local -- and the sow-down will be significant.

ssh(1) recently got "-c none" reinstated, to disable encryption. It may be worth hunting for a new version of ssh(1) that supports it.

VladiBG's suggestion of dump(8) makes sense in that context. Though FreeBSD dump(8) only applies to file systems (Solaris dump could be used on directories).

[The fix I eventually applied was to use a cryptographic hash of the file name to formulate a path name in which to store each file, pushing the leaf nodes of the file system down two levels. The retrieval "key" for the file (its file name) was thus unchanged (it just had to be hashed to make parent and grandparent directory names.).]
 
I use rsync to create backups for a directory with 5.000.000 files. Would it be faster if I split the files in 100 directories (50.000 files per directory)? I want to run only 1 rsync to sync everything and not one per directory.
I would suggest using one of these:

1. Use inotifywait from inotify-tools package with rsync like that:

Code:
% pkg which $( which inotifywait )
/usr/local/bin/inotifywait was installed by package inotify-tools-3.21.9.6

% inotifywait -r -m -e close_write --format '%w%f' /tmp \
    | while read MODFILE
      do
        # rsync "${MODFILE}" "${SOMEWHERE}"
      done

I other words - use rsync for each file that has been modified.



2. Use lsyncd with rsync for continuous replication:

- https://github.com/lsyncd/lsyncd



3. Use syncthing for continuous replication.
 
I've not had any issues with tens or hundreds of thousands of files or more under UFS but it does get a bit unwieldy.

Not directly relevent here but worth noting there's a 32K sub-directory limit on UFS that has bitten me.

So I've tried something like:

/blah/images/<product-id>/

So:

/blah/images/1/
/blah/images/2/

... and so on and on

And that works fine until you have around 32,768 or more products ...
 
Out of interest, how would git handle something like this?
This stores in a (basic) database so I don't think it would be inefficient. It should also only check for new, deleted, changed files so can reduce some processing.
 
You know that you can do replication + snapshots and then send these snapshots further away on another computer/medium?

No i didn't know that.
In my terminology i understand the replication as continues process of transferring data using synchronous, semi-synchronous or asynchronous replication between source and target destination. So when you modify the data on the source the modification is written to the destination. Which in some cases may be not desired for example when the data is corrupted on the source this corruption will be written to the destination.

The backup on other hand is archiving copy of the information to file, storage device or pipe which then can be handles and saved remotely in a single point of time. For offline file system it's done using copy of the information and for the live file system it's done via a temporary snapshot which after the transfer is deleted. That's why i suggest to use dump(8) on the live filesystem which will make a UFS snapshot (note that UFS Journaling must be disabled for live backup using tunefs -j disable / under single user mode) and then it can pipe the backup to gzip and send it over SSH for remote storage if it's desired. It look like this:

/sbin/dump -C16 -b64 -0Lau -h0 -f - / | gzip | ssh -p 22 user1@backup.example.com dd of=/home/user1/1402022.dump.gz

The benefit of using this is that you can have differential backups using the levels from 1-9 which can be scripted for daily, weekly, monthly differential backup or use level 0 which will make a full backup. Excluding the files or directories from the backup process is done via file system flag nodump which can be useful for information that can be restored from other source. For example you can exclude the ports tree as it can be fetched from internet using chflags nodump /usr/ports

The restoring process of full backup is done via restore(8) command. For gzip it look like this:
note: this will overwrite the current information in / root
zcat /media/root.dump.gz | restore -rvf -

for interactive restore:
zcat /media/root.dump.gz | restore -ivf -
 
Back
Top