The original post: /r/usenet by /u/einhuman198 on 2025-01-23 18:38:32.
So Highwinds just hit 6000 days of retention a few days ago. When I saw this my curiosity sparked again, like it did several times before. Just how big is the amount of data Highwinds stores to offer 6000+ days of Usenet retention?
This time I got motivated enough to calculate it based on existing public data, and I want to share my calculations. As a site note: My last Uni Math Lessons are a few years in the past, and while I passed, I won't guarantee the accuracy of my calculations. Consider the numbers very rough approximations, since it doesn't include data taken down, compression, deduplication etc.. If you spot errors in the math please let me know, I'll correct this post!
As a reliable Data Source we have the daily newsgroup feed size published by Newsdemon and u/greglyda.
Since Usenet backbones sync the all incoming articles with each other via NNTP, this feed size will roughly be the same for Highwinds too.
Ok, good. So with these values we can make a neat table and use those values to approximate a mathematical function via regression.
For consistency, I assumed the provided MM/YY dates to each be on the first of the month. In my table, the 2017-01-01 (All my specified dates are in YYYY-MM-DD) marks x Value 0. It's the first date provided. The x-axis being the days passed, y-axis being the daily feed. Then I calculated the days passed from 2017-01-01 with a timespan calculator. I always use the first of the month for consistent calculation. For example, Newsdemon states the daily feed in August 2023 was 220TiB. So I calculated the days passed between 2017-01-01 and 2023-08-01 (2403 days), therefore giving me the value pair (2403, 220). The result for all values looks like this:
The values from Newsdemon in a coordinate system
Then via regression, I calculated the function closest to the values. It's an exponential function. I got this as a result
y = 26.126047417171 * e^0.0009176041129*x
with a coefficient of determination of 0.92.
https://preview.redd.it/719sxsea3see1.png?width=211&format=png&auto=webp&s=3dfdd1b23a369e8a8f1b229fc86cc821de3ea264
Not perfect, but pretty decent. In the graph you can see why it's "only" 0.92, not 1:
https://preview.redd.it/73gbnybb3see1.png?width=2125&format=png&auto=webp&s=ef437be446f33ec0bcc5b7e0ac4c1e812906c01c
The most recent values skyrocket beyond the "healthy" normal exponential growth that can be seen from January 2017 until around March 2024. In the Reddit discussions regarding this phenomenon, there was speculation that some AI Scraping companies abuse Usenet as a cheap backup, and the graphs seem to back that up. I hope the provider will implement some protection against this, because this cannot be sustained.
Unrelated Meme
Aaanyway, back to topic:
The area under this graph in a given interval is equivalent to the total data stored. When we calculate the Integral of the function, we will get a number that roughly estimates the total storage size based on the data we have.
To integrate this function, we first need to calculate which interval we have to view and calculate with.
So back to the timespan calculator. The current retention of Highwinds at the time of writing this post (2025-01-23) is 6002 days. According to the timespan calculator, this means the data retention of Highwinds starts 2008-08-18. We set 2017-01-01 as our day 0 in the graph earlier, so we need to calculate our upper and lower interval limits with this knowledge. The days passed between 2008-08-18 and 2017-01-01 are 3058. Between 2017-01-01 and today, 2025-01-23, 2944 days passed. So our lower interval bound is -3058, our upper bound is 2944. Now we can integrate our function as follows:
Integral Calculation
Therefore, the amount of data stored at Highwinds is roughly 422540 TiB. This equals ≈464,6 Petabytes. Mind you, this is just one copy of all the data IF they stored all of the feed. For all the data stored they will have identical copies between their US and EU Datacenters and they'll have more than one copy for redundancy reasons. This is just the accumulated amount of data over the last 6002 days.
Now with this info we can estimate some figures:
The estimated daily feed in August 2008, when Highwinds started expanding their retention, was 1.6TiB. The latest figure from Newsdemon we have is 475TiB daily from November 2024. If you break it down, the entirety of the daily newsfeed in August 2008 is now transferred every ≈5 minutes. 4.85 minutes for 1.6TiB in November 2024.
With the growth rate of the calculated function, the stored data size will reach 1 million TiB by Mid August 2027. It'll likely be earlier if the growth rate continues growing beyond it's "normal" exponential rate that the Usenet Feed Size maintained from 2008 to 2023 before the (AI?) abuse started.
10000 days of retention would be reached on 2035-12-31. At the growth rate of our calculated graph, the total data size of these 10000 days will be 16627717 TiB. This equals ≈18282 Petabytes, 39x the current amount. Gotta hope that HDD density growth comes back to exponential growth too, huh?
Some personal thoughts at the end: One big bonus that usenet offers is retention. If you go beyond just downloading the newest releases automated with *arr and all the fine tools we now got, Usenet always was and still is really reliable for finding old and/or exotic stuff. Up until around 2012, there used to be many posts unobfuscated and still indexable via e.g. nzbking. You can find really exotic releases from all content types, no matter if movies, music, tv shows, software. You name it. You can grab most of these releases and download them with Full Speed. Some random Upload from 2009? Usually not an issue. Only when they are DMCA'd it may not be possible. With torrents, you often end up with dried up content. 0 Seeders, no chance. It does make sense, who seeds the entirety of exotic stuff ever shared for 15 years? Can't blame the people. I personally love the experience of picking the best quality uploads from obscure media that someone posted to the usenet like 15 years ago. And more often than not, it's the only copy still avaliable online. It's something special. And I fear with the current development, at some point the business model "Usenet" is not sustainable anymore. Not just for Highwinds, but for every provider.
I feel like Usenet is the last living example of the saying that "The Internet doesn't forget". Because the Internet forgets, faster than ever. The internet gets more centralized by the day. Usenet may be forced to further consolidate with the growing data feed. If the origin of the high Feed figures is indeed AI Scraping, we can just hope that the AI bubble bursts asap so that they stop abusing Usenet. And that maybe the providers can filter out those articles without sacrificing retention for the past and in the future for all the other data people are willing to download. I hope we will continue to see a growing usenet retention and hopefully 10000 days of retention and beyond.
Thank you for reading till the end.
tl;dr Calculated from the known daily Usenet Feed sizes, Highwinds approximately stores 464,6 Petabytes of data with it's current 6002 days of Retention at the time of writing this. This figure is just one copy of the data.