Hi everyone, I’m seeking advice and opinions. I’m building a web-based RSS reader/search engine/discovery tool. Like any RSS reader, my app fetches content from feeds and displays it to subscribers. Often, blog authors include only a short summary to the RSS, and the user has to visit the blog website to read the full content. My app also attempts to scrape the full webpage of the blog post for search indexing purposes (respecting <code>robots.txt</code>, of course). It also saves the HTML content for archiving purposes, like Internet Archive (if the author disallows <code>ia_archiver</code> user agent, I also honour that and don’t archive). So, since the app might already store the full content, my dilemma now is whether it’s ok (ethical) to show the full article in my reader? This view is never public, so only registered users who subscribe to the blog can see it. But still, it feels wrong, because it’s not even like browser’s “reader mode” — the user does not visit the original page at all. Not ok because:
- Authors who only include a short summary in the RSS do so precisely because they want readers to visit their website.
- Visiting the original blog is a much more personal experience than reading all blogs from the same UI of the reader app; bloggers craft their digital gardens for visitors!
- Some blogs include styles, math, scripts, etc. which aren’t rendered correctly elsewhere after scraping.
Ok because:
- It’s a nicer UX for the reader?
Curious what others think.
I made stuff that was maybe border line ethical, and I think ultimately it was up to what I could deal with, when I looked in the mirror.
If the mirror is giving you a hard time, then build something to contact the website owners who you are scraping, and give them an option for profit sharing, if any (you would be like an advertisement for them), and give them an option to opt-out. It should be easy to automate this. Just look for the contact or about page, and if a form or email or social site then use that.
Chances are, this will result in some work with no real gain, or loss, because most will not reply or not notice your attempts at contact. And anything lost can be used in your advertising yourself as ethical which gain new users.
Then you can say you tried your best and maybe the mirror will be somewhat kinder.
The original content creator relies on advertising, click-throughs and maybe merchandise sales that you would be denying them by scraping their content. This is the entire argument against Google doing what they’ve been doing for the past decade. The value of Google, and by extension your rss reader, is generated by other people’s content, it has little inherent value on its own as without content, it is useless. Drain the income of content creators for long enough and you no longer have content creators, so now you need another thing to generate content. Enter generative ai.
And thus was the internet of 2024 forged, through stolen content and seeing no value in the creations of people, only desiring more content at any cost, as long as that cost to the platform is zero.