From S3 to R2: An economic opportunity
Cloudflare's R2 is an undiscovered gem and is redefining the economics of data storage
S3 is an amazing service. One can argue that the launch of S3 was the origin of the modern data stack. The most common initial use case was to store static assets such as images, stylesheets, and JavaScript code but we quickly discovered that can dump whatever we want to it at a low cost. The approach shifted from being careful around what was being stored to just store everything since it may end up being useful. Before S3 big data only existed in the megacorps but the ability to store nearly infinite amounts of data cheaply launched a whole new ecosystem.
One of the most common data specific uses of S3 is to stage data. If you built your own data collection stack you typically have Kafka or Kinesis collecting events and are offloading them to S3 for permanent storage. Once the data in S3 you have a variety of options. You can use Spark to read, manipulate, and transform the data before dumping it back to S3 or a warehouse. Or you can load the data into your data warehouse, such as Snowflake, and do the data processing there. Once you’re happy with that neatly massaged and transformed data you can put in a variety of places to support a variety of use cases. You can have it back in the data warehouse in order to power a reporting API, or you can dump it as an Iceberg table to S3 and have it accessible via Jupyter notebook, or just dump it into parquet files that can be read via DuckDB, or countless other options.
The biggest problem with S3 is data transfer costs. It’s a well known secret that AWS uses data transfer costs as a lock in mechanism. From the AWS S3 pricing page you’re paying anywhere from $0.05/GB to $0.09/GB for data transfer in us-east-1. At big data scale this adds up. AWS obviously has lower internal pricing and can pass on the savings but the point isn’t to make more money as much as it is to encourage lockin which of course leads to more money. A few years ago Cloudflare wrote up an analysis that estimates that AWS has an up to 8000% markup on data transfer.
Cloudflare has an incentive in calling out S3’s egregious pricing - they have a competing service called R2 but it really is better. They charge less per gigabyte of storage, less for the various operations, and do not charge for data egress at all. It’s amazing what they’ve achieved. These days everyone is trying to find ways to leverage AI on top of their data. And given how nascent the AI space is there’s still significant differentiation in the AI services offered across the cloud providers. Microsoft has OpenAI, AWS has Anthropic, and Google has Google. While we wait for the offerings to get more commoditized it’s valuable to have the option to use our data with whichever provider gives us the most benefit. Cloud neutrality doesn’t matter as much when services aren’t differentiated but matters a great deal when there is true differentiation. And R2 has that.
I’m surprised R2 hasn’t been widely adopted and consider it an undiscovered gem. If you have data heavy workloads and have a high storage bill you should seriously consider R2. In fact, there’s an opportunity to build entire companies that take advantage of this price differential and I expect we’ll see more and more of that happening.
Whenever I try to migrate using Sippy to R2, I get this error when I make the PUT request to set up Sippy:
```
{"success":false,"errors":[{"code":10063,"message":"Invalid upstream credentials"}],"messages":[],"result":null}
```
I'm using AWS credentials with read and list permissions for my bucket and R2 credentials with read and write permissions. Were you able to get past this?