There are some subreddits which may never happen to come online again. There are also some subreddits which are very valuable because of the old posts and responses. Alas, the intersection isn’t empty (I personally am anxious about r/suggestmeabook and r/TrueLit).
Naturally, one would like to download all posts and comments to an offline storage. Naturally, the usual methods are useless when the subreddit is private.
Are there any good options for the pessimistic scenario? Scraping the web archive? Filtering ML datasets? Anything else?
Apparently old full dumps are available:
Downloading and filtering the entire Reddit is terribly inconvenient but at least it should give mostly complete data.
Looks like newer data may be there: https://archive.org/details/archiveteam_reddit .
You can probably try pushshift
Is it going to work on a private subreddit?
Also, pushshift is effectively dead as far as I can see.
I think they wouldn’t have the posts if the posts were private at the time of posting but otherwise they store the posts so the posts should be available even though the subreddit is private now. Also the archives might be dead but the data until 2023 is available as a torrent here on Academic Torrents