How do you people manage/organize/index downloaded PDF & other type files?

Not specific to FreeBSD but surely some of you have run into similar problems?

I download a lot of PDFs but never got around to organizing them properly. Also, it is not just pdfs but epubs, tex files, text files, webpages, directories, images, videos etc. Is zotero the best tool for this? Any other useful tools? What I want ideally:
  • watch certain directories for new pdf (& other type) files
  • extract meta data & text data, the latter for text based search
  • maintain orig urls
  • show them in an "inbox", with suggestions as to where to file
  • use meta data to name the *symlink* (but not the original pdf)
  • filing via symlinks
  • originals may optionally get moved to a more permanent place
  • allow adding tags & notes
  • access from any unix machine
  • cli / web / gui interfaces, but prefer more modular cli approach
  • learn from prior actions
I think zotero does a lot of this.
 
Nobody really knows.

I create a title for the pdf that best identifies it if the title doesn't already do that. Then I file it in a directory where I think it best belongs; such as Drive/FreeBSD/

Later, if I'm looking for a particular subject, I'll try and find it using ls. If that doesn't work then I'll use find. If still no luck then grep

Everything should be as simple as possible.
 
Later, if I'm looking for a particular subject, I'll try and find it using ls. If that doesn't work then I'll use find. If still no luck then grep
I rely on my memory and MacOS spotlight but both are far from perfect and getting worse! Also with 7K-8k+ pdfs (some duplicates) it is a pain. These have been collected over decades and I haven't used a consistent naming scheme over that period. And I quickly scan downloaded tech reports but usually don't read them cover to cover. But may want to read/refer to them even years later.
 
Requires the Google ecosystem, but anyway:

notebooklm.google.com is a nice toy for pdfs. It is a LLM that you can throw a few documents in and it will answer from those documents. Questions, summaries etc. I find it to be quite good as long as the document doesn't rely too much on pictures.
 
None .. and I have total mess in those. I mean I keep them under e.books/ and e-learning/ for 20 years or so (yeah, dot vs hyphen mistake too).
But I still end up googling stuff when I need something even though I do have those books in my repository.

I'm considering just deleting it all, unlikely I will ever use them again. But for now I keep syncing them around when I move data.
 
Is that not a CMS (content management system)?


I personally order my files in directories, nothing special.

Perhaps you can write your own program? I did something like that for my CDs that I backed up with cdda2wav.
I put the CD tracks in a directory hierarchy Music containing wav/mp3 files, a subdirectory may contain a file
Desc describing its contents. Then a program generates a sqlite3 database from the directory and the Desc files.
If you want to reach it from the internet, you will have to write a cgi script with authentication.
 
Just individual files organised by general subject in a directory tree. I like mupdf for viewing pdfs, it's very fast. I use vifm to give me a browser and viewer launcher. For a quick look at html I use 'links', also very fast. I do the same with mp4's etc. Vifm lets you associate a filename suffix with a specific launcher. You can do things like regex search on file names. It's fast and works well. I guess something like kde dolphin would be the gui equivalent, but I use windowmaker, I mainly use the terminal environment.

Yeah, it's easy for it to turn into a big mess, I don't spend a lot of time on it. In fact I usually use online search and read online information before I look at anything I've saved previously. Even if I remember I've saved something somewhere locally, it's often faster just to search for it online, having a fast broadband connection. Which is kind of scary, because you become dependent to some extent; and it makes you lazy. Perhaps it's the only realistic way of dealing with the information flood.
 
notebooklm.google.com is a nice toy for pdfs. It is a LLM that you can throw a few documents in and it will answer from those documents.
I have heard good things about notebooklm. I don't (yet) want to ask an LLM to interpret the contents of a PDF but may be a local LLM can help with categorizing.

This seems promising as a component: https://github.com/NanoNets/docstrange
Convert and extract data from PDF, DOCX, images, and more into clean Markdown and structured JSON. Plus: Advanced table extraction, 100% local processing, and a built-in web UI.
 
Websites on local server, if you stay at home/office like me. Websites can be created by arguments and inside a website sub-arguments (sections and sub-sections). On the disk make the same tree of directories with the same name of the websites names, sections and sub-sections. If you travel and need to keep documents on working machine copy the same tree of directories.
 
As with others here, my ~/Downloads is mostly a mess. For books I want to keep and plan to read I use deskutils/calibre. It's great for organizing content, you can even add extra metadata fields, just set Calibre to use external PDF reader of your choice, internal PDF conversion can be messy.
 
Yeah, I bet a lot of us have a mess of stuff that we just never get to organizing. For what it's worth, I have pdfs directory, which has subdirectories of novels, shellscripting (which contains any tech book, not just scripting) and the shellscripting subdirectory has an epub section. And that was the result of me saying, Wow, let me organize this, then getting tired after awhile. I have a docxls directory and $DEITY help me if I ever need anything out of it, though I do have some subdirectories there, such as 2025medexp and 2025medreceipts. But then, looking through that directory in order to put something here, I had to laugh, I have 2025lease.pdf another called lease.pdf from some year or another. And yeah, those pdfs are in the docxls directory rather than the pdfs directory.

And when I try to get rid of old stuff, too often I get rid of something I'll need three months later, and go frantically through backup files, trying to remember the name.
 
For anything that's an e-book (or vaguely resembles one), magazines, papers, slides/supplementary documents for talks or courses I have a folder in my www/nextcloud with somewhat of a hierarchy and the PDFs are named after the title and author, so i can easily ls|grep them. The ebooks are also occasionally organized with deskutils/calibre, but since I haven't used a e-book reader for years, this effort has mostly stalled...

Things like manuals/specsheets etc as well as all other documents (invoices, insurance policies etc pp) go into deskutils/py-paperless-ngx where I can do full-text search of the contents. I also started feeding some papers and other contents of the former collection into it if I want to have it fully searchable and it automagically ingests all documents I recevie via mail. All tax-related stuff also goes in there and I use it as a personal archive for any documents I have/want to preserve but don't want to clutter up lots of space; i.e. anything I'm not legally required to keep the original dead-tree version goes to the shredder after archiving. (the archive is backed up in multiple tiers, including sysutils/tarsnap)

I never heard of zotero, but it seems if you have to organize a multitude of file/media types it might be helpful. For documents I'm pretty happy with paperless-ngx.[/port]
 
I wonder what the old masters who created that ancient library would make of our modern information systems.
We can go further back and see what the ancients said about writing!

The story goes that Thamus said much to Theuth, both for and against each art, which it would take too long to repeat. But when they came to writing, Theuth said: “O King, here is something that, once learned, will make the Egyptians wiser and will improve their memory; I have discovered a potion for memory and for wisdom.”

Thamus, however, replied: “O most expert Theuth, one man can give birth to the elements of an art, but only another can judge how they can benefit or harm those who will use them. And now, since you are the father of writing, your affection for it has made you describe its effects as the opposite of what they really are. In fact, it will introduce forgetfulness into the soul of those who learn it: they will not practice using their memory because they will put their trust in writing, which is external and depends on signs that belong to others, instead of trying to remember from the inside, completely on their own. You have not discovered a potion for remembering, but for reminding; you provide your students with the appearance of wisdom, not with its reality. Your invention will enable them to hear many things without being properly taught, and they will imagine that they have come to know much while for the most part they will know nothing. And they will be difficult to get along with, since they will merely appear to be wise instead of really being so.“
— from https://newlearningonline.com/liter...-on-the-forgetfulness-that-comes-with-writing

[Incidentally, the old Indian tradition was students directly learning from their teachers. The epics Ramayana and Mahabharata and indeed the Vedas were orally transmitted thus over thousands of years. Brahmi & Devnagari scripts came much later]
 
I prefix filenames with (in doubt today's) rfc3339 (scripted),
Well wonders me that people do not do that, apparently most people.
How I want that electronic Mail of my Bank do that!
Instead of it, I receive files with very long names, spaces and special symbols in between.
About indexing/tagging: that is what I do with my CD collection, as mentioned above, with a sqlite3 db.
 
You have not discovered a potion for remembering, but for reminding; you provide your students with the appearance of wisdom, not with its reality. Your invention will enable them to hear many things without being properly taught, and they will imagine that they have come to know much while for the most part they will know nothing. And they will be difficult to get along with, since they will merely appear to be wise instead of really being so.“
I guess the invention of writing itself was the original "information technology"! There was a strong oral tradition in my part of the world too... the norse and icelandic sagas, the irish and british bards and druids, stories and legends memorised and passed down the generations without writing; a tradition of oral culture that probably stretches back to the bronze age, perhaps even further.

I remember reading about the Incas 'quipu', which was a way of using knotted cords to store information... fascinating.

As an aside, I've thought for many years that the Unix userland's focus on strong text processing capabilities was a good choice. Sophisticated editors, the widespread use of regular expressions, special purpose text processing languages (sed, awk, etc), the typesetting tools.
 
Back
Top