Arka Chaudhuri, Cranky Cloud Guy

“Organizations that have done little or no cloud optimization, are overspending by 70% or more” — Gartner

It’s been a while since organizations began embracing a cloud-first strategy. Some took a prudent and deliberate approach and assessed existing usage and workloads, others went for broke and lift-and-shifted everything they could. The cloud enabled business models that were cost prohibitive in traditional datacenter based computing became more attainable, and the pay-for-use model freed organizations big and small to rapidly scale up and down without breaking the bank. This elasticity gave them the opportunity to bring products and services to market exponentially faster, and only pay for what they used. Or so it seemed till they saw the bill.

Gotta feed the cattle, people!

Cloud Cost and Infrastructure

The bottom line is that the cloud is someone else’s infrastructure that you rent space in. Just because you’re not tending to owned infrastructure in your very own datacenter doesn’t mean that the costs for compute and transit go away. The cloud service providers (CSPs) have hyperscaled resources that apportion the costs across customers, but you still have to pay your share.

The “cattle vs pets” analogy does hold true, but you still need to feed the cattle. Do you know where your cattle are, and how much feed they are munching per day? Do you know how much your stopped instances are costing, and how much you’re paying for EBS volumes that haven’t been used in a while?

Who’s raiding the fridge?

The shift of compute and storage costs from capex to opex was empowering to say the least. Businesses could scale and pay for consumed resources as they grew — that is, if said scaling brought in corresponding revenue. A DDoS attack doesn’t bring revenue, nor do unoptimized applications and legacy monoliths running up infra and transit bills while sitting relatively idle.

The lure of infinite resources is another cause for concern. CSPs have cleverly abstracted the common units of compute resources (vCPUs, RAM, block, and object storage) with their own scaling units (Quick, tell me without looking this up: how many vCPUs and how much RAM in a c3.xlarge instance? Thought so.)

Building resource-expensive dev stacks becomes very easy when you don’t immediately have to care about how many resources and/or managed services you’re using. We can’t expect applications that come out of these stacks to be the most efficient and optimized on the planet, can we? Without guardrails applied and enforced from the beginning, cost efficiency becomes an afterthought that will come back to haunt you.

So when confronted by finance with cloud spend numbers, are you able to trace resource usage to specific projects, users, and roles? How are you allocating the spend to business priorities? Are you using the right metrics to measure your usage? What are your historical usage patterns, and what can you learn from them? Are the resources leading directly to increased ROI and reduced TCO?

Stop blaming the CSPs

In a way, CSPs are like McDonalds. It’s not their fault you choose Big Macs and thousand-calorie shakes over the salads in their menu. The CSPs provide a very wide range of resources, tools, and managed services for you to use. You have to understand what these do, how much they cost, and take note of any tradeoffs they force when you apply them to your requirements. That’s table stakes. The cloud gives you the ability to iteratively refine every layer of your stack, but for that to be possible, you should avoid the antipatterns that will constrain you in the near future. To their credit, CSPs provide architecture frameworks such as AWS’ Well Architected Framework, but you also must remember that they make money by selling you more services than you might need. Study the framework, and apply the best practices but don’t feel beholden to it. Draw your own conclusions and map your own tradeoffs to the framework as you build out the architecture.

If you are planning on migrating an on-prem legacy application, take note of implicit dependencies and price them in as if they are running in the cloud. Add in factors such as licensing, operations and maintenance costs, and see if it makes financial sense to use a managed service instead. If you’re starting afresh, start with a minimum viable outlay to match your minimum viable product — and keep a service migration strategy in your back pocket. Ultimately, no one knows the resource requirements of your apps better than you, right?

“But,what about vendor lock-in?” you ask.  Tradeoff time again. The provider where your applications are currently parked is a vendor with its own costs. Since you have already made the notional commitment to move away to a different vendor, you have the ability to find out exactly what you’re getting from your friendly neighborhood CSP, and for how much, just by running scenarios on the calculator they provide online for free. Use that to your advantage. 

Tag, you’re it!

Establish a tagging schema for cloud resources and follow it diligently. Use automation to prevent launching of untagged or partially-tagged resources. Shut down unused resources to save money — automate that based on the tagging schema. I’m not the first person to tell you these things. Do them. 

Can you see it?

You can’t solve what you can’t see. Without going into finopsy verbiage, these are the two most important levels of visibility for cloud costs.

No matter if your infrastructure is all VMs or if you’re running containerized applications, these considerations apply universally.

Notice that I haven’t said a word about chargebacks and similar apportionment methods. That would bring in organizational dynamics, and while it’s a vitally important consideration in cloud cost optimization, every organization has a different take on it. I’ll leave that for another article.

Some Common Optimization Strategies

The success of these strategies is predicated on your exact situation, so consider these as possibilities that may be helpful in your case. At Oteemo, we specialize in assessing your infrastructure needs and giving you actionable recommendations. Please give us a call if you would like specific ideas around your requirements.

Defining Environments through Automation

Infrastructure- and Compliance-as-code are your friends when it comes to cloud cost control. Defining environment stacks as code using your IaC tools of choice gives you the ability to create standardized, versionable building blocks from CSP services. These blocks can then be used in environment manifests that define the stacks you need for specific applications. The manifest lives with your application code, and can be versioned to provision different environments (dev, test, QA and production). As part of the manifest, resource tagging is also automated. This allows meaningful billing reports that can be traced to specific apps or business units. It also enforces consistency and resource optimization by standardizing environment provisioning in the cloud.

Compliance-as-code keeps environments in runtime compliance with established standards. Deviations can be logged and audited using logging tools that are either provided by CSPs or centralized. Every layer of the stack — infrastructure, platform and application — can be logged and audited to keep resource utilization within defined limits.

Sustained Usage Discounts

Use reserved instances for long-running workloads to reduce per-instance cost. Find other sustained usage discounts within the CSP’s services wherever possible.

Rightsizing

This is one area where historical usage pattern tracking pays dividends. Rightsizing could be done in many different ways to better fit resource size to workloads. It could involve moving to a different instance class, vertical scaling and resource splitting, among other methods.

Spot Instance Utilization

CSPs allow the usage of short-term instances at a cheaper rate than on-demand. The downside, however, is that these resources could be taken back at short notice by the CSP if there’s a surge in demand. To effectively use spot instances your workloads have to fit a certain pattern. They should be relatively short run, and should be able to account for short-range state failure. There are also companies like Spot that use algorithmic fleet management with spot instances to provide relatively stable compute for your applications, with some usage caveats. 

Bringing it home

My friend Doug raises about a hundred heads of Piedmontese beef cattle in his picturesque farm in Fauquier County, Virginia. His beef is raised to be hormone-free and antibiotic free, but he’s straightforward when he’s asked if his cattle are grass-fed. “They are,” he says, “for most part. We have to finish with a little corn.”  As I stood selling coffee next to him in the local farmers’ market every Saturday for eight long years, I saw grown men in tears recalling just how good his beef was. As you might expect, he never names his cattle. He just raises them as humanely as possible, with as few tradeoffs as possible. But there are tradeoffs, and everything doesn’t work the same way for everyone. You have to adapt your approach to gain maximum benefit from circumstances, and cloud cost control is no exception. Arming yourself with the correct metrics and usage data, and aiming for the appropriate business outcomes, will get you farther than any amount of best practices ever can.

This article skims the surface of a vast ocean of cloud cost optimization control strategies. Entire careers have been made and disciplines built around this. At Oteemo, we talk every day to organizations looking to contain cloud bills, and we begin by understanding their business imperatives. Without that knowledge, no amount of cost control would be truly effective. Let’s talk. We can help make sense of cloud costs, and give you actionable strategies to help optimize it over time.