So you want to architect some infrastructure?
As part of the day to day job as a software engineer, we’re tasked with a lot of things. Understanding the problem, the constraints, finding our what tools are at our disposal, coming up with a coherent semantic solution using all these and then finally writing the code. However, once code is written, it needs a place to run, to store data, to run specialized operations for example rendering a PDF, or resizing images etc. This stack involves thinking about the infrastructure design in depth and how the various components will interact with each other.
So how do you start approaching the problem of “backend infrastructure” design? Over the years, I’ve been making a list of things I like to consider while thinking through a backend design problem. Let’s take you through this journey…
User/Persona
To begin with, always know who your users are, is there a persona you can conjure up for your user. Sometimes, it’s interesting to even think of machines as users because at the end of the day, there is a human writing code that runs on the machine or integrating with APIs etc (unless Skynet is good enough for a 9-5 coding job).
Persona design questions:
- What does success mean to your user?
Let’s start with the most important question - if we are not designing things that are useful to your users then the rest of work doesn’t make any sense. Knowing what your users consider success is something that should be crystal clear. If it isn’t, go have a chat with some people in your team, the product managers and designers, and iterate on the problem definition till everyone is on the same, accepted definition of what does success mean for a user?.
For example, if you are designing a system to send SMSes, you can talk about queueing systems, low-latency services, high-availability all day long but we need to know that the entire fancy system running on 7 lambdas for no value unless the users are consistently seeing the smses - not just that they were delivered to the device but that a user actually consumed that information and found it useful.
Sometimes, thinking about what success means from the POV of an operator is also interesting - a good to compare and contract the differences between what you as an operator or the person designing/writing/maintaining software things should we considered success v/s what a real user things is success. This is where user interviews can be super valuable.
- How will they access your system?
For human users, this could mean describing whether they will interact with your system via a mobile app, web-app, email/sms/push-notifications etc. It could even be a Point-of-Sale widget at the coffee-shop. Knowing the means of interaction and its limitations is important. Think outside the box there, if people are interacting with your system in a physical space, you could leverage bluetooth, captive wifi portals or even ultrasonic data transfer to interact with people’s devices.
In case of a "machine-user", this would mean describing what protocols are best suited for your use cases - keep things simple but know that HTTP/TCP is popular but aren’t the only options.
- Cost
Do you have a budget for your design? Now that you have an idea of what success means, it can be further qualified as what does success mean at the given cost?. If the cost is unacceptable, maybe it’s time to rethink what success should mean and changing the requirements. After all, your business should have a clear path to delivering value to your users at a cost that doesn’t bankrupt the company.
- Maintenance & Complexity
Don’t forget to consider the continued success of your system after it is launched - it’s not going to run in a vaccuum, bit-rot happens, security vulns pop-up. Have a good story for how the system will live its life. This is also a good point to chat with your team about tackling tech-debt upfront. The more hands-off you and your team can be from having to baby-sit your systems, the productive everyone will be. You can go work on new things without getting caught in the perpetual patch-break-patch loop. I’d like to encourage you to take pride in designing/writing software you can forget about - things that keep running with minimal intervention.
Boundary Conditions
- Scale, Throughput, Latency, DataSize
It’s really important to know when to stop. What scale are you solving for? Good load tests are key here. Some really useful tools for figuring out what scale your system can handle are: Apache Bench, Curl, JMeter. Writing custom load testing code is often a good investment if your project is more complicated. Make sure to add monitoring so that you can compare graphs instead of just how fast the load test completes. This is also very useful for improving the performance of your system if and when needed.
Knowing the scale, throughput, and latency requirements is an great exercise in tradeoffs. We can scale infinitely if we have infinite servers but infinite servers will cost infinite money. While thinking through all these, remember to not base your assumptions on old data - this is a very common mistake I’ve seen time and again. Your mental heuristics about how many rows a mysql instance can write per second could be (probably are) many years old. Make sure to question all assumptions.
- Data TTL
It’s refreshing to see various governments are coming up with data privacy laws. How does this translate to your system? There are generally limits to how long you can store data in your systems so data ttl is another interesting lens to view your system through. Retrofitting something like this can become very costly since data has a tendency to spread tendrils all over your system.
- Security+Privacy concerns
Similar to the last point, think of security from the get go. The security and privacy landscape is becoming increasingly complex everyday. Falling behind here can leave you open to make-or-break security incidents. One of the common mistakes I’ve seen here involve leaving instances exposed to the public internet, same with Database endpoints, leaving more ports open than needed, especially when they could be running some of the more critical things like SSH on them.
Concurrency
- Which components are async?
Race conditions aren’t only limited to code, they can also occur in the scope of infrastructure itself.
- Are you looking for an MVP or a future-proof solution
- Challenge everything
- What components are needed
- What are their capabilities
- Mention what you’re generally going to skip
- Privacy
- Operations
- Mention:
- Replication
- Scaling
- Data access patterns