Understanding the total cost of ownership of serverless and container applications. by Alan Helton | November, 2022

It costs a lot more than your monthly AWS bill to build and maintain an app in production

Image by jcomp on freepik

Last week I published an article discussing how serverless is more expensive than containers. In that post, I did a simple comparison between Lambda, EC2 and App Runner to show the relative computation cost.

I’ve also mentioned something in that article several times that I realized maybe needed to be clarified a bit. I qualified my findings by saying that even though we found a tipping point where serverless provisioning is more costly than resources, Total cost of ownership (TCO) still low,

Total cost of ownership is probably not a new concept for many of you reading this post, but if it is, I would like to discuss it and the specifics when we compare serverless and serverless applications. want to talk about.

When we talk about the cost to a company to successfully run an application in production, we should consider more than our monthly AWS bill. Everything that goes into operating that app is in cost.

Contributors to the total cost of ownership:

  • basic infrastructure – compute, provisioned components, storage
  • early development – time from idea to production
  • maintenance – Patching, OS updates, bug fixes, etc…

This is not an exhaustive list, but it represents the three major sources of cost for production applications.

also known as running costThis is what my post focused on last week. Comparing serverless apps to traditional cloud apps like this is like comparing apples and oranges, but the end result is your monthly bill.

Serverless infrastructure costs are based on your usage whereas traditional workloads are based on provisioned resources. These are highly variable costs based on monthly requests volume, average request/response size, and peak versus continuous load times.

Infrastructure costs are billed monthly and will recur for the life of your application. The costs may go up or down as you improve your app, but you’ll always have some sort of infrastructure bill.

they are at the fore cost-to-manufacture Charges that become difficult to measure. To determine development costs, you need to take into account how many developers will be on the project, how much each of them is paid, and how long it will take.

I’m sure not everyone will agree, but in general, serverless development is faster than traditional development.

Why?

Well for one, infrastructure expansion has been taken out of the question. Outside of planning for magnitudes of scale, Serverless handles it for you. Counter that with designing a system that can scale up and down elastically without making a lot of provisioning to reduce or under-provision lost opportunity costs and make your customers unhappy. Determining how to handle production traffic takes careful timing and planning, in addition to building the software.

If you imagine that a full-time employee costs the company $100K/year, factoring in a few extra weeks of development per developer quickly starts to add up. If Serverless eliminates 6 weeks of development time from initial build with a team of 6 engineers, you’re looking at a $70K difference in cost of build.

After initial development is complete and your application is live in production (congratulations!), you enter the maintenance cycle or cost-to-support stage.

This is an ongoing personnel fee for the time it takes for your application to run smoothly. Verifying your application appropriately, monitoring dashboards and dead letter queues, and adding new features all fall under this phase.

Some organizations divide responsibilities into different teams at this stage. Development teams build new features and tooling, while site reliability engineering (SRE) teams handle management tasks such as monitoring dashboards and responding to incidents.

However, the cost of personnel is the cost of personnel. The company is still paying for the engineer time in ongoing maintenance and monitoring, so this is a cost that must be included in the TCO.

Serverless apps have dead letter queue monitoring and alerts when usage gets close to service limits. But because of the shared responsibility model with serverless, the cloud vendor takes over the server software, networking infrastructure, hardware maintenance, and more, so you don’t have to.

Contrast this with traditional workloads where server software such as OS updates and patching, network configuration and virtual machine management is put back on you. You pay not only for the resources you provision but also for the effort to maintain them.

Maintenance costs also include productivity losses. When engineers are busy configuring a fleet of servers or troubleshooting a problem on a virtual machine, they are not given the opportunity to build or innovate new features. This affects the company in the long run as it moves through the growth program.

I worked for a few years on an application that was completely serverless. This application required integration to communicate with a piece of legacy software hosted on a load balanced fleet of EC2 instances.

To facilitate communication, we built a middle tier that would respond to events via webhooks, perform some data transformations, then call legacy system APIs.

It ran on a separate fleet of mid-level EC2 instances, resulting in an architecture resembling the one below.

intentionally simplified architectural diagram

As our live date drew closer, we ran our first set of load tests and quickly discovered that the serverless application quickly outgrew our mid-tier. The volume of events coming to the integration webhook overwhelmed the system and prevented the downstream integration from running.

So we regrouped, spent a lot of time planning and forecasting traffic loads and peaks, and resizing the integration level. We also put an API Gateway and SQS Queue in front of it to act as a buffering mechanism. EC2 instances will then query the queue and close the item at their own pace.

Modified architecture after load testing

After another load test, it looks like we sized the environment right and were ready to go live. Hurray!

On the day we went live, the system had about 50,000 users. The serverless application scaled gracefully and handled traffic growth without any issues. The integration tier was also able to handle the traffic, but began to create a modest backlog of work. With the work to be done the SQS queue was starting to queue up items for longer and longer periods.

So our SRE team responded and scaled the EC2 instances both horizontally and vertically across the fleet. In turn, the backlog reduced and things continued as we expected.

Over the next few months, an eye-opening pattern began to emerge.

The serverless application we built performs like a well-oiled machine. Some flaws were found in the software, but there was no problem with the infrastructure. It increases during peak hours and decreases during off-peak. Overall, the dev team responsible for that application was able to continue growing the application without interruption.

However, the integration level and legacy applications were a different story. We had a team of 7 engineers working round the clock to monitor these applications. They continuously monitored CPU usage, memory allocation, and error queues for the first several weeks in an attempt to stabilize the environment. The bumpy nature of our app will lead to a boom that will affect our initial configuration of this fleet of EC2 instances.

For the first few months, nothing was done on these two apps except damage control. There was so much work to do with updating patches, rebooting servers, reconfiguring load balancers, and monitoring server statistics that it was impossible to do anything else.

This was the exact opposite of a serverless app.

In my example, a serverless app costs less money to run, build, and maintain, but that’s not always the case. There will be times when the cost of your infrastructure will be higher for serverless, but you need to take into account how much time personnel are taking to run a “cheap” solution.

Decreased productivity, slow rate of innovation and burnout are all non-tangible aspects of TCO that have a major impact on an organization. Serverless development enables engineers to move faster, not worry about (to an extent) sizing their application right, and saves the company some serious money.

Serverless isn’t a silver bullet, it just takes organizational maturity to handle cloud-native applications in production. Upskilling not only your developers, but also your support team is critical to your success with this technology.

When questioned about “to scale” serverless costs, be sure to talk not only about infrastructure costs, but about additional costs to the company such as the cost of construction and the cost of support. Be sure to talk. All of these factors drive the success (and profit margin!) of your application.

Happy coding!

Leave a Comment