Following a post from my friend Julien Delange (Tech Ramblings) on software over-engineering, I want to share my thoughts about over-architecting in my preferred field of Public Cloud Architecture.

Background

I have been doing Cloud Architecture for more than ten years and have seen many different scenarios and use cases, from startups to GAFAM and multiple company verticals, from TV audience measurement to Gambling and Energy Producers. I also frequently challenge my fellow architects’ decisions.

Over Architect

A common practice when starting public cloud architecture for the first time is to start with massive architecture, using trendy and complex architecture like event-driven, container orchestration (who said Kubernetes?), or re-doing what you have been doing in DC for decades.

A common mistake is to start with a highly resilient system capable of handling millions of users, even for an internal usage only app without any internet exposure.

Okay, it’s cool, it has a nice looking schema, and it’s easy to deploy on Public Cloud, but folks, you are not Netflix or Airbnb.

Start simple: a simple Instance, an apigw with single serverless function, a container with persistence in a Managed NoSQL table, or a CSP Managed SQL Database.

Builders of Cathedrals

Working backward from business need

Always ask about the business needs, especially Recovery Time Objective (RTO) / Recovery Point Objective (RPO) in the first place, but constantly challenge them with the cost of the associated target Cloud Infrastructure. Often, they will reconsider the initial RTO / RPO when the cost of such architecture comes into the discussion.

Build in Public

While I’m building my SaaS Product (unusd.cloud) I’m facing a scaling problem after 2 years and more than 450 users, with hundreds of AWS accounts on the platform. The platform mostly runs on Lambda functions, and the scanner engine has reached the 15 min limit for a specific large enterprise customer with hundreds of thousands of AWS assets in a single AWS account, which led to Lambda timeouts (15 min).

At the beginning, I was starting to rebuild the engine in decoupled StepFunction with loops and concurrent jobs, but it was clearly over engineering, over thinking, and for sure a lot of work (also a lot of fun and tears).

Finally, I’m thinking of moving the scanner feature to a more classic ECS Fargate Task, as the time spent scanning an AWS account is not really a thing for this specific customer. After all, re-architecting for 0.01% of your user base, is it worth the effort?

Reversibility

You don’t care about reversibility, as changing CSP will never happen (trust me). A well-documented app with comprehensive and maintainable code will be a better bet than spending months ensuring your app infrastructure is reversible and cloud-agnostic. Save your efforts on this.

Takeaways

Remember, you’re not building Cathedrals. Your architecture will only last a few years before being replaced by better, more up-to-date solutions that are tightly aligned with today’s product needs.

Therefore, it’s important to keep your architecture dead simple and iterate based on user feedback and evolving business needs. This dynamic, iterative approach is one of the key advantages of public cloud solutions.

The beauty of public clouds lies in their ability to seamlessly align infrastructure provisioning with business requirements and usage. (I’m not talking about the autoscaling feature.)

Keep it dead simple. 🙏🏻

That’s all folks!

zoph.