Is there any product that can act as a knowledge base for topics (materials will include archived versions of websites, documents, media)?

@MigratingtoLemmy · 2 years ago

Is there any product that can act as a knowledge base for topics (materials will include archived versions of websites, documents, media)?

@vegetaaaaaaa · edit-2 2 years ago

organisation structure with plain directories

Slightly edited so you get the idea:

├── ARCHIVE
│   ├── DOCUMENTS
│   │   ├── 2018
│   │   ├── 2019
│   │   └── 2020
│   ├── WORK
│   │   ├── PROJECT1
│   │   └── PROJECT2
│   ├── DATA-MISC
│   ├── NOTES
│   ├── GAMES
│   ├── IMAGES
│   │   ├── MISC
│   │   ├── 2018
│   │   ├── 2019
│   │   └── 2020
│   ├── BOOKS
│   │   ├── IT
│   │   ├── DIY
│   │   └── NOVELS
│   ├── MUSIC
│   │   └── ARTIST - ALBUM
│   ├── SOFTWARE
│   │   ├── LINUX
│   │   ├── WINDOWS
│   │   └── ANDROID
│   └── VIDEO
│       ├── MOVIES
│       ├── MUSICVIDEOS
│       └── DOCUMENTARIES
├── DOWNLOADS
│   ├── DOCUMENTS
│   ├── WORK
│   ├── BOOKS
│   ├── GAMES
│   ├── MUSIC
│   ├── SOFTWARE
│   └── VIDEO
└── TMP

I use UPPERCASE for my base directory structure, so I know if a directory is uppercase it’s probably part of the fixed structure. The key is to keep it max 2-3 levels deep.

Level 1:

ARCHIVE: stuff I want to keep, gets backed up
DOWNLOAD: stuff I did not have time to listen/look at/process yet. Not backed up (but I do backup a list of the files in this hierarchy).
TMP: stuff I use regularly but does not deserve to be archived/backed up (working copies of projects, random scripts/programs, VM disks…). Temporary, expendable.

Level 2: Broad topic or media type. Max 5-8.

Level 3: Finer-grained topic/media type. Only the ARCHIVE tree has this level of organization. There may be directories deeper than that but I don’t actively manage them, they just… exist (extracted archives, etc.). One exception are subdirectories named NOBACKUP which are always excluded from automatic backups.

archivebox […] are you writing an alternative tool? Would you like to share the repo? I was considering just using wget to pull down pages but that might not work all the time.

I am working on this tool which is a generic data manipulation/workflow tool. The shaarli workflow already works to grab bookmarks from Shaarli and download audio and video files. The webpage archiving module is still not written, it’s the early design stage (issue), it will probably use wget in the backend, the alternative would be running a full headless browser and I don’t want to get into that. This is my first medium-sized python project and I try to keep it clean, so it will take some time. Currently I’m more focused on other workflows/parts of the software.

As for file organization inside the directories, I try to maintain consistent/useful file naming including (depending on directory) date in YYYYMMDD format, author/parties involved, subject.