organizing files

trutlze · Mar 4, 2012

Hi,

the common way of organizing files is sorting them in file trees. Sorting files in proper folders helps finding files later. To me it seems that if the amount of files increases the effort of sorting and finding them in folders increases rapidly. The common way to solve this problem is to use more or less sophisticated search tools. But I think it would be better to get down to the root of the trouble. If files are well-organized you don't need search-tools. So I asked myself if there is another way of sorting files. I think that adding meta-informations to files may help finding them later easily. Several file-types support meta-informations (i.e. mp3,jpg). But I think there has to be a file-type independent way of doing so.

I found mainly two solutions for this problem:

Using a database (so-called collection-managers)
1. file-type-dependent (i.e. iTunes, Amarok, Griffith)
2. mainly file-type-independent (i.e. Tellico, GCstar, Data Crow)
Using a specialized file-system (i.e. tagsistant, tagfs)

To me the solutions of category 1 are complicated: you need to install software with many dependencies (X, graphic-libs, ...). The solutions of 1.1 seem to be a no-go: one program for your ebooks, one for your mp3s, and so on.

I'm searching for a clean, efficient "UNIX-way". To me the solutions of category 2 seem to be the best.

My questions:

Is there a way of sorting files in file-trees more efficient? (file/folder naming strategy)
How do you keep track of your files?
Is there another common way of sorting files (that I didn't mention) than sorting them in file trees?
If tagging files is the best option to add meta-informations to files, is there a proved FreeBSD-way?

Morte · Mar 4, 2012

I think usage is probably the first consideration. Most of what computers are used for fall into a handful of categories. You have documents, pictures, movies, music, and maybe other projects (programming/design/etc). I think that's the first step in organizing.

One method would be putting everything for one type together. Like all Star Trek movies, pictures, documents together, but I think that doesn't organize things according to how they are used. For instance I may want to watch a ST movie, but how likely is it that I'd want to watch a movie AND look at pictures from it? I personally never access things that way.

I hear what you're saying about "tagging" and I gave up on that long ago. I have a lot of music coming from all over the place, and making sure the tags are consistent is way too much work. Another problem is that much of my music isn't in english and I wouldn't even know how to search for songs in Russian and Japanese. So my music strategy was.

/music/genre/artist/##-album/files

I don't listen to polka and death metal at the same time so genre naturally reflects what I'm in the mood to listen to. Artist cuts that into a more specific category. I prefix albums with a number for the order (by year would also work). In the end I mainly try to keep the file names meaningful and semi accurate, but even if I don't the bucket is close enough to get the general idea. Similar to having files organized in the first place, naming them well is one tried and true method that will always work for you no matter what OS you use. That might mean you have to drag and drop files from a file manager into the program if it doesn't allow you to browse music easily that way. (one of the reasons I'm not an itunes fan).

trutlze · Mar 4, 2012

Your organization-concept is pretty much as mine. But it fails in some situations:

Let's say you want to collect things of a specific topic (i.e. FreeBSD), you collected texts, videos, audios, et al. Using our organization-concept these files are spread over different folders and can't be found "instantly".
Let's say you listened to some music or watched a movie or so and you think that this might be interesting for your boy-/girlfriend and you want to show it to him/her next time you're meeting or s-/he's having time. Maybe this will be one or two weeks and meanwhile you find some other stuff that might interest him/her. Using our organization-concept we would probably create a folder and copying or linking this stuff to it producing redundancies. Or we would create a text-file and paste paths to it swearing like a trooper once we've changed some path and forgotten to change it in the text-file too.

I think that in above situations (and there might be others) a system that I described above would be an advantage.

ctengel · Mar 5, 2012

Unfortunately I can't give much practical advice, but it's something that I've thought about for a while. When I was first thinking about it years ago (I think this was before I had any concept of "tagging" things or databases.), I called what I wanted "multiple inheritance." (i.e. a given file or directory could belong to multiple directories) Now this certianly isn't as polished or organized as a database/table based method, but it is possible without anything special, like bulky graphical applications or a new filesystem.

You could use symlinks to show the additional inheritances of things (or even hardlinks I guess?). In your example, you'd

Code:

$ mkdir stuff_to_show_gf
$ cd stuff_to_show_gf
$ ln -s ../movies/comedy/a_movie

Now obviously this does have it's isues with managability and trying to get a bird's eye view. But it does have the advantages of working with all filetypes and being readable by all applications, which seem to be advantages shared only with filesystem solutions. If I were to seriously do this, I'd probably write scripts to manage it, so I could do something like:

Code:

$ tag genre=scifi star_wars.mpeg 2001_space_odyssey.avi
$ ls by_genre/comedy
star_wars.mpeg 2001_space_odyssey.avi

So I guess the idea is I'd keep all the files organized by where they came from effectively or how they got on my system (or in some cases just where they actually are (if I have a system with distinct hard drives, etc)), and then in one place have some directories like"by_genre" or "by_artist" and in there have anything "tagged" as such. All based on symlinks.

If I ever implement it I'll let you know...

fluca1978 · Mar 5, 2012

I have developed applications that either used a filesystem approach or a database approach. The former was organized via a set of links: files were organized in directories using a tree as main category, and then a set of links to provide different views were placed. In particular I had a set of scripts to create and recreate the links, since the links could be accidentally removed or be broken or whatever.
The second type of application was relying on a database to find the right file depending on the user search criteria. Of course, this is the best solution, since the database can become as complex as user needs, but will point always to the right file.
I guess both solutions are valid, it depends on the usage of your file tree. And of course, you can develop derivatives from both solutions.

ctengel · Mar 6, 2012

@fluca: Would you be willing to share the scripts you used to maintain the set of links in the first solution. It sounds alot like the idea I never implemented and I'd be interested to see. The DB one would be interesting too!

fluca1978 · Mar 6, 2012

ctengel said:
@fluca: Would you be willing to share the scripts you used to maintain the set of links in the first solution. It sounds alot like the idea I never implemented and I'd be interested to see. The DB one would be interesting too!

Well, it is nothing so special. Here there's a simplified version:

Code:

customerID=$1
orderID=$2
WHEN=`date '+%F'`

cd /archive/pdf

if [ ! -d /archive/pdf/conf-ord/byCustomer/$customerID ]
then
   mkdir -p /archive/pdf/conf-ord/byCustomer/$customerID >/dev/null 2>&1
   <generate order file here>
fi



/bin/ln -s /archyive/pdf/conf-ord/byUser/conf-$USER-$WHEN-$customerID.pdf /archive/pdf/conf-ord/byCustomer/$customerID/Order-${orderID}.pdf >/dev/null 2>&1

I provide the script with the customer id and an order id. If the customer id is still not on the disk (either new or deleted) a directory for the default archiving (by customer id) is done and the pdf file is generated (here you need your own business logic). Than a link from the generated file into another directory with another name (by user who generated it and when) is created. Each time this script is launched, the customer order is archived.

trutlze · Mar 6, 2012

@flucal:
Your example is nice, but I think it's a limited approach to solve my problem. In fact one can say that you used two tags for your files. In my vision "unlimited" tags should be possible.

@ctengel
That's why I think a linking-approach wouldn't be the right thing. The effort in creating, modifying or deleting links seem to be extraordinary.

So maybe the database-approach is the right one? What about using text-databases (which can be edited with vi in exceptions) instead of real databases to limit the software-dependencies?

drhowarddrfine · Mar 6, 2012

And then you change your mind.

For some reason this reminds me of a comedy album from years ago. Reporter is interviewing a suicide photographer. "Take picture. Give to girlfriend. She cries her heart out."

"There are 106 ways to kill yourself."
"106?! Wow. Off-hand I can think of hanging...."
"Hanging! 107!"

trutlze · Mar 6, 2012

What's strange in changing one's mind especially if the task is to explore the "best" way? Or am I missing the target of your statement?

drhowarddrfine · Mar 6, 2012

There's nothing strange about changing your mind. I was reading about all the organizational efforts and was thinking to myself how some of those are good but it might all get thrown out if you changed your mind about one category.

phoenix · Mar 7, 2012

Sounds to me like a re-implementation/re-imagination of KDE SC 4.x' Nepomuk.

Pushrod · Mar 7, 2012

My files are organized by what they are:

tv shows
movies
short clips
documentaries
binary files (apps or plugins, fonts, etc)
games
documents (pdf etc)

Never go more than about 3 levels deep.

Also, I have an "incoming" directory where everything lives until it is consumed once. Then, when I consider it "used", I either nuke it, or place it in the hierarchy where it belongs. Having this setup means that recent stuff, which is the most interesting by far, is in one big, monolithic place which is easily searched.

mix_room · Mar 7, 2012

This is personally still one of my peeves with modern computers. We have technology that can let me find almost anything I want, and a whole bunch of things I don't want to find, on the internet at the touch of a few keys. BUT I am still expected to know where I put the letter I wrote to my grandmother by manually sorting it.

Searching, indexing and organizing are all great jobs for a computer to do. On a desktop they are generally static, so the task shouldn't actually be all that difficult. And yet I have not been able to find an operating system which has a very good search functionality built in.

I would love to be able to just dump all my files into 'My Documents' or other unsorted storage '/home/$USER/' and then have the computer index it. It (the computer) can even sort it (my files and data) in any way suitable to save storage. This micro-managing of storage should not be a task that people should be performing today.

On this note I am stuck with categorizing my files into a tree-subtree-subsubtree type structure. $DOCUMENTS/{$LETTERS,$PICTURES, ...}

ctengel · Mar 7, 2012

@mix_room:
In terms of indexing text-based things, my limited experience with "Spotlight" on Mac OS X was actually quite good. Required absolutely no configuration on my part, and was very fast to find things like letters.
Honestly I haven't tried to set something up quite like that on any other OS besides that, but I would think there is a tool out there for Unix-like OS's that can easily index text-based.

However, I would imagine that "Spotlight" wouldn't work as well for me manually tagging video and so forth I got from all over; so now back to the question at hand regarding tagging with symlinks or a database or whatnot.

@trutlze:
The way I see it, a database is probably best suited for storing/managing the metadata, and it better encompasses the sort of relationships we are trying to set up here. (not sure about text-based...what I am envisioning is more RDBMS, but I guess a set of CSV files WOULD work, with proper locking percautions in place)
However, the symlink option has the advantage as being "understood" (to some extent) by any application that uses the filesystem.

Personally, I see a few options to implement the thing that I think both of us are trying for (of course understanding that yes, as some pointed out, we could just end up changing our minds!
1. Symlink management scripts, probably something more complex than fluca's implementation (nothing wrong with it; excellent at what it does really, but we're looking for more than two tags and so forth)
2. Database option. Files stay laid out in filesystem (or I guess you could put smaller files in the DB itself!); you'd probably want some sort of management tool. (Strictly speaking not necessary...you can edit CSVs or run SQL commands all you want, but I imagine this would be even worse than managing symlinks necessary) Depending on your needs, this may be enough; you search for something with certian tags, and it spits out what files those are, maybe generates a playlist! There should also probably be some kind of API and you could hook any apps you like to use (if they have APIs) into there. Alot of work though.
3. The issue with #2 for me though is that it requires use of an additional tool every time I want to look at something and doesn't integrate well into existing apps (or if it does, you have to code for each one). A "compromise" I think would be a database-based storage of metadata info, and then upon each update or adding of new files (maybe triggered by placing file into an "incoming" dir; I haven't addressed adding new media much yet, just viewing), etc, the change/update in the DB would cause a "resync" of symlinks (auto-generated from the master database). In this way it would allow kind of "hybrid"/"best of both worlds."
4. This is probably going way out there, but I was also thinking a database-driven approach, but instead of auto-populating a tree of symlinks like in #3, a FUSE driver or something like that which would provide seemingly direct access.

Maybe I'm overthinking this, but it is a real issue I think. I am not sure, but I am guessing that somebody somewhere has already maybe at least partially addressed this. Are there any free/open solutions available for Unix-like OS's that do anything like this?

trutlze · Mar 7, 2012

@mix_room
I think that indexing files to search for them is a result of not being able to find files. In my opinion with a proper file-ordering-system you don't need to search for files, because you already know where they reside. By the way locate() or find() fit my need in most cases (i.e. for files containing text).

@ctengel
I think that from the link-approach to the filesystem-approach the degree of implementing the idea grows. At the same time the degree of being what we wanted grows. The most complex thing (filesystem) fits the need of being understandable by any program (like the link-approach). It would create those "links" virtually using an internal "database" like any filesystem (inode - Wikipedia). It wouldn't need no additional software that's not included in the base system.

So far I think that a tag-filesystem is the best solution. In my opinion this filesystem should be a system-wide filesystem (kernel-module?, written in C, performing well) and no FUSE-thingy (Tag based file system using FUSE). And as far as I know nobody has implemented it for FreeBSD.

Sorrily I'm not able to implement this idea at the moment as I'm not that familiar with the system internals. Hopefully there is someone else?!

Or is there another impressive solution?

ctengel · Mar 8, 2012

I guess I like the idea of the TaggedFS, but some of it's shortcomings would be prohibitive to me; most notably seems to be the requirement that each filename is unique. Also is the inability to remove tags.

What I do like about it though, and I would do also, is to leave the actual storage of the files to another filesystem, or even set of filesystems! (So I can tag things maybe on my local system, external drives, NFS shares, etc, all together) Also, with the issues I've had in the past with silent data corruption (ReiserFS on Linux; not the filesystem's fault, but the hardware; it trusts that whatever is written to disk will be read back the same!), hence my usage of ZFS. I don't think I'd trust myself to do something like that.

You mention not doing FUSE, but I guess I'm a bit biased towards FUSE. I've used several FUSE-based drivers on Linux with great success and no noticable performance issues. I'm not sure what the situation with that is in FreeBSD, but the other big reason why if I was going to write my own implementation of this, I'd do it with FUSE, is that I'd be able to port it more easily. (Not to mention I don't know the FreeBSD kernel at all, and my Linux kernel knowledge is fairly limited. With all that being said though, I have no issue with using a FreeBSD-module, if someone else wrote it!) I also would actually probably not need to use a system-wide form of tagging; my main usage would be media and documents. Not that I'd never want to tag other things, but I'm good with confining it.

Another idea I just had which may sort of completely head off the performance issues would be a combo of #3 and #4. So it's a distinct filesystem (like #4, whether FUSE or module), but rather than present actual files present symlinks. (like #3, but different/better in that we're not constantly populating/resyncing symlinks on another filesystem) The advantage of this is that most file read/writes would NOT have to be funneled through this new filesystem driver, FUSE, etc; and also it allows the "storing filesystem(s)" to handle permissions, locking, and so forth.

I like your comparison with inodes; I never would have thought of that. I'm still not exactly sure on how I'd actually implement the internal database though (flat files, SQL/RDBMS, etc). It is interesting to note that we are generally moving away from the usual Unix way of doing things...we're setting up another database in addition to the inodes of the storing filesystem. But I guess we sort of have to just to solve the problem we have, which is basically the inadequacy of that traditional directory model.

Makes me think the ideal would be to extend existing filesystem drivers to support this internally, but the problems with that are numerous:
-Data integrity; like I said before I don't like messing with it!
-Inability to mount the filesystems just to get data off on any system without "extended" drivers.
-Each filesystem type has to be done individually
-Can't combine local/remote stores into one database

So the total solution I'm leaning towards (and actually seriously toying with implementing) would be:
-FUSE based, providing directories of symlinks wherever it is mounted
-Have actual data stored in traditional filesystems of arbitrary type, possibly multiple; you could I guess even just say "/" for system-wide; in any case the directories of stored data would be specified in config/db, and symlinks would go out to whichever are available
-Database would use some sort of hash as key, keep track of where the file really is (which directory/filesystem that is under the realm of it, and then path to the file), as well as any sort of tags
-When someone goes to "ls" something like Genre/Comedy, it would query the db for all such files, and then come back with symlinks, each pointing to the real files
-Like I said, the actual structure/implementation of db itself is up in the air still
-The "datastores" would have to be "watched" (maybe at mount time, at regular intervals, and on-demand) for new stuff or anything moving. The hashes come in handy this way (If I decide to copy/move such and such a file from my network storage to local laptop so I can watch it on a plane, it would figure this out and I can still access the file the same way.)
-If there are multiple datastores, some sort of priority system would be good (If the file with that hash is available via network or locally, symlink to the local one)
-Whether to use symlinks or to pretend to actually have the file (and therefore pass-thru requests for it) I guess could be a mount-time option. I could see why in some cases you'd want "real" files.
-So far I've dealt only really with media, but once we get into documents it brings up a question of what to do when a file is edited.
-There would be some sort of management tools to do the actual tagging (this could be done in FS like with TaggedFS, but my tendency is that that would actually be more work. Would you rather "tag genre=scifi starwars.avi" or "ln -s starwars.avi /path/to/tagdbfs/genre/starwars.avi"?

Anyway, if I have some spare time soon I'm thinking I may start reading up on the FUSE API.

fluca1978 · Mar 8, 2012

ctengel said:
1. Symlink management scripts, probably something more complex than fluca's implementation (nothing wrong with it; excellent at what it does really, but we're looking for more than two tags and so forth)

Of course you can place as much links you want.

darkfeline · Jan 11, 2013

Hi. I'm new here, I've briefly read the forum rules. I apologize if necroing is against forum policy, but I'm wondering if this is still an issue and whether the following seems like a good solution for it: https://bbs.archlinux.org/viewtopic.php?pid=1215914 (disclaimer: this is my project). In short, it provides a tag-based general file organization system based on hard links.

The reason for this shameless plug is that I'm looking to see if there is a need for the project I'm working on. If no one needs it, then I won't be as motivated getting it into shape. If there IS a need for it, I'll be glad to work on it, and hopefully others can get some use out of it.

trutlze · Jan 12, 2013

Surely this topic is still an issue to me whenever it's not that pressing that I needed a solution the day before.

To me a solution of this problem wouldn't push new dependencies into the system (if possible); it would use whatever is there in base-system that (in my opinion) is far enough. So to me, implementing it in python would be much like a workaround than beeing a "real" solution. Thinking this way even a bourne-shell script would be a better solution as the interpreter is already included in the base-system.

darkfeline · Jan 12, 2013

Hi trutlze

The solution my project proposes seems like a perfect fit to what you're saying. Maybe you've misunderstood/my descriptions aren't clear enough?

The organization used by hitagiFS is composed entirely of hard links and symlinks of the underlying file system. My Python scripts simply make it easier to work with. Think of them as bash scripts composed of "ln", "rm", "find" and (a LOT of) logic statements if you want (but coding in Python is much friendlier than bash).

Think of it this way: the only dependency is python, and hitagiFS is only really needed to make it easier to tag/untag (really just hard linking and unlinking files). After you've set up an organized system, you can uninstall hitagiFS and you can use the directory structure as is. (although I'd hope you find it useful enough to not justify freeing up a few MB from uninstalling)

throAU · Jan 18, 2013

IMHO, I'm all for the search engine approach - i.e., improving search AI.

Why?

Because tagging is prone to human error, requires foresight on the part of the human creator, and humans are lazy/time poor.

And essentially, all you're doing by tagging is burning human time to create an index - boring labour intensive work. Being the lazy typical human I am, the first work I drop when I am busy is the boring stuff that only has the potential of some future pay-off down the road.

Computers are good at repetitive, boring work, such as automatically scanning and indexing. They're less prone to forget some tags that may apply, too.

When I go looking for things 6+ months down the track, it is quite likely I am looking for the content for different reasons than originally tagged...

Spotlight does a pretty good job on the Mac, but it isn't perfect.

darkfeline · Jan 20, 2013

I suppose different people have different needs. There are quite a few problems with search. First, it is complicated and prone to its own errors (when you do a google search, how many of the hits are things actually relevant to what you're looking for? and that's one of the best algorithms too). Second, it's not really applicable to non-text things (such as music or pictures).

Tagging can be automated to an extent, and can be organized per personal demands. The biggest drawback of tagging is the time factor. If you've already built a collection and need to organize retroactively, then it is a big pain in the behind, but if you tag files as you accumulate them, the time impact is much lower. It's the same thing as "put things back when you're done using them, and you don't have to spend a week cleaning up the mess later".

P.S. I think you're confusing the issue. Tagging can be automated. But if you ask me to choose between automated tagging vs AI search algorithm (e.g. it detects when you search certain terms and choose certain results, and learns and changes various internal scores), I think tagging is much more suitable for PC use. Imagine searching "weekly report" and having some reports not show up because they've fallen off the AI's relevance score!