Instead of pinching pennies on Snowflake, make dollars
Say "No" to a world where everyone is cost cutting their Snowflake bill. Here's how to actually use Snowflake to drive value.
With interest rates no longer at zero we’re in lean times and nearly every company has taken an interest in cost cutting. One of the biggest line items for data driven companies is their infrastructure, often powered by Snowflake. A ton has been written on how to reduce the bill but this comes at a cost. As the great Wayne Gretzky said “Skate to where the puck is going” and if everyone else is cutting you should instead lean in to the data you have and unlock its full potential. Here’s a series of tips for what you can do to maximizing the value of your data while maximizing your Snowflake bill:
Warehouses
Do not have auto suspend rules. You never know when someone on your team will need access to your data. After all, you are a data driven company and you want to democratize data access. There’s nothing as frustrating as having to wait for a query to finish and if people know they’ll have to wait they won’t bother running the query in the first place.
Use the largest warehouse possible. Sure there might be diminishing returns that you can easily test and benchmark but what if there aren’t. You want the results fast and it’s worth paying for the ultimate performance. Now you don’t even have to double guess yourself since you know you’re getting the best performance money can buy. And isn’t that piece of mind worth it?
Autoscale to infinity. One of the cloud’s big benefits is infinite scale and the ability to pay for what you use so take advantage of that. You never want to slow your team down and big results require big analysis.
Queries
Don’t cache. You can’t trust Snowflake’s caching so it’s always best to go to the source of truth. Especially now that you have the largest, always on, infinitely scalable warehouses you know you can handle it.
Don’t bother optimizing. If you have a query that works just keep adding to it. You may not need to use all the fields or tables but they’re already there and you want to avoid having to rewrite a query. If you can create a single query that can answer multiple questions even better. That might be more difficult to maintain but imagine being able to share that query with everyone on the team. Monoliths are making a comeback so let’s bring back the monoquery.
Tables
Create multiple versions of every table. You never know what you’ll need so if you have the opportunity to create multiple versions of the same data go for it. This also makes sure performance is as high as possible since people will always know to go to the most optimal table for their use case. And as new data comes in make sure to update every table - even if the data may not have changed since it’s better to be safe and you want consistently actionable data.
Recreate tables from scratch. The world keeps moving and so should your data. Instead of appending data to your existing tables just recreate them from scratch. Then you can ensure it’s perfect every time. And if you ever need to change the schema around you won’t have to worry about a complicated backfill - every table creation is a backfill.
Enable time travel. Who knows if you’ll need data from a year ago. You don’t want to risk not having it so enable time travel to be safe.
Keep data forever. With the pace of AI these days there might be a whole new tool that allows you to unlock value from your prior data so do not get rid of anything. You do not want to be responsible for missing out on business value due to some arbitrary data deletion rule. And since you’re paying separately for storage and storage is cheap it’s a rounding error.
Ignore clustering. This assumes you know your use case but we live in a dynamic world and you do not want to limit yourself or your business. It’s safe to keep things fluid and not commit to a single approach.
Columns
Use the largest possible field types. You may think you know the data you’ll have coming in but the world is full of surprises so better to build in that optionality early. It’s columnar storage anyway so it’ll all work out.
Keep adding new columns and never audit. Just because something hasn’t been used in the past year doesn’t mean it won’t be in the future so just keep everything around. The business is paying you to generate insights and data, not remove it.
Observability
Always run the full gamut of checks. Data is the lifeblood of any organization and you want to ensure it’s accurate and consistent. Every time your data changes or new data arrives you need to validate it to make sure it’s good enough for you to work with. Doing simple counts and sums is old news, instead lean in to more sophisticated AI approaches that can uncover the hidden anomalies. Sure it might be more expensive and not scale but is there a price to good data?
Pay for every vendor that promises data quality improvements. There are a ton of vendors here and they might all overlap on the bulk of what they do but they vary on the edges. And you generate alpha on the edge so you should work with every observability vendor out there to make sure your data is perfect.
With these tips and tricks you’re well on your way to unlocking the tremendous potential of your data.
Disclosure: This was mostly sarcasm and a reaction to all the recent cost savings posts. Of course you should follow the best practices and maintain good data hygiene. At the same time if all you’re doing is worrying about cost savings you likely are focusing on the wrong thing.