Hi everyone,
I’m seeking advice and opinions.
I’m building a web-based RSS reader/search engine/discovery tool. Like any RSS reader, my app fetches content from feeds and displays it to subscribers. Often, blog authors include only a short summary to the RSS, and the user has to visit the blog website to read the full content.
My app also attempts to scrape the full webpage of the blog post for search indexing purposes (respecting robots.txt, of course). It also saves the HTML content for archiving purposes, like Internet Archive (if the author disallows ia_archiver user agent, I also honour that and don’t archive).
So, since the app might already store the full content, my dilemma now is whether it’s ok (ethical) to show the full article in my reader? This view is never public, so only registered users who subscribe to the blog can see it. But still, it feels wrong, because it’s not even like browser’s “reader mode” — the user does not visit the original page at all.
Not ok because:
- Authors who only include a short summary in the RSS do so precisely because they want readers to visit their website.
- Visiting the original blog is a much more personal experience than reading all blogs from the same UI of the reader app; bloggers craft their digital gardens for visitors!
- Some blogs include styles, math, scripts, etc. which aren’t rendered correctly elsewhere after scraping.
Ok because:
- It’s a nicer UX for the reader?
Curious what others think.
I made stuff that was maybe border line ethical, and I think ultimately it was up to what I could deal with, when I looked in the mirror.
If the mirror is giving you a hard time, then build something to contact the website owners who you are scraping, and give them an option for profit sharing, if any (you would be like an advertisement for them), and give them an option to opt-out. It should be easy to automate this. Just look for the contact or about page, and if a form or email or social site then use that.
Chances are, this will result in some work with no real gain, or loss, because most will not reply or not notice your attempts at contact. And anything lost can be used in your advertising yourself as ethical which gain new users.
Then you can say you tried your best and maybe the mirror will be somewhat kinder.