Over Architecting on Public Cloud

Following a post from my friend Julien Delange (Tech Ramblings) on software over-engineering, I want to share my thoughts about over-architecting in my preferred field of Public Cloud Architecture.

Background

I have been doing Cloud Architecture for more than ten years and have seen many different scenarios and use cases, from startups to GAFAM and multiple company verticals, from TV audience measurement to Gambling and Energy Producers. I also frequently challenge my fellow architects’ decisions.

Over Architect

A common practice when starting public cloud architecture for the first time is to start with massive architecture, using trendy and complex architecture like event-driven, container orchestration (who said Kubernetes?), or re-doing what you have been doing in DC for decades.

A common mistake is to start with a highly resilient system capable of handling millions of users, even for an internal usage only app without any internet exposure.

Okay, it’s cool, it has a nice looking schema, and it’s easy to deploy on Public Cloud, but folks, you are not Netflix or Airbnb.

Start simple: a simple Instance, an apigw with single serverless function, a container with persistence in a Managed NoSQL table, or a CSP Managed SQL Database.

Builders of Cathedrals

Working backward from business need

Always ask about the business needs, especially Recovery Time Objective (RTO) / Recovery Point Objective (RPO) in a first place, but constantly challenge them with the cost of the associated target Cloud Infrastructure. Often, they will reconsider the initial RTO / RPO when the cost of such architecture comes into the discussion.

Build in Public

While I’m building my SaaS Product (unusd.cloud) I’m facing a scaling problem after 2 years and more than 450 users, with hundred of AWS accounts on the platform. The platform mostly run on lambda functions, and the scanner engine has reach the 15 min for a specific large enterprise customer with hunded of thousand AWS assets for a single AWS account that lead to timeout the lambda (15min).

At the begining, I was starting to rebuild the engine in decoupled StepFunction with loops, and concurrent jobs, but it was clearly over engineering, over thinking, and for sure a lot of work (also a lot of fun and tears)

Finally, I’m thinking to move the scanner feature to a more classic ECS Fargate Task as the time spend to scan an AWS account is not really a thing for this specific customer, but and after all, re-architecting for 0.01% of your user base, does it worth the effort?

Reversibility

You don’t care about reversibility, as changing CSP will never happen (trust me). A well-documented app with comprehensive and maintainable code will be a better bet than spending months ensuring your app infrastructure is reversible and cloud-agnostic — save your efforts on this.

Takeaways

Remember, you’re not building Cathedrals. Your architecture will only last a few years before being replaced by better, more up-to-date solutions that are tightly aligned with today’s product needs.

Therefore, it’s important to keep your architecture dead simple and iterate based on user feedback and evolving business needs. This dynamic, iterative approach is one of the key advantages of public cloud solutions.

The beauty of public clouds lies in their ability to seamlessly align infrastructure provisioning with business requirements and usage. (I’m not talking about autoscaling feature)

Keep it dead simple. 🙏🏻

That’s all folks!

zoph.

Background#

Over Architect#

Working backward from business need#

Build in Public#

Reversibility#

Takeaways#