7 System Design Patterns Every Cloud AI Engineer Should Know

Tools are easy to learn. Patterns are what keep your system alive.

Hi Inner Circle!

Welcome to this week's edition.

Before you dive into tools ~ Kubernetes, LangChain, vector databases.. you need patterns.

These 7 are the ones that show up in every production AI system, every architecture review, and honestly ~ every SA interview too.

Whether you're upskilling into cloud AI or already building, these are worth knowing..

Let's get into it ~

Before we begin… a quick thank you to today's partner, Dex.

Meet Dex ~ the world's first Autonomous IT Engineer.

Following its successful launch on March 31, the Dex team is now opening access to Put Dex to Work.. a hands-on program built for real environments, not demos.

Dex handles complex IT tasks across Microsoft 365 and Google Workspace, right inside Teams or Slack; with the reasoning depth of a skilled engineer. Up to 90% of issues resolved before they ever become a ticket.

The program includes 3 months of unlimited usage + a dedicated onboarding engineer.

1. API Communication

Services in a distributed system need a way to talk to each other. An API defines that contract — one service says: send me a request in this format, and I'll send back a response in this format.

Two modes.

Sync ~ one service calls another and waits for a response. Used when you need an answer immediately. REST and gRPC are the common implementations.

Async ~ one service sends a request and moves on. The other processes it when ready. Used for jobs that don't need an instant reply — document processing, batch jobs, notification pipelines.

One concept worth knowing: idempotency. A request times out and you retry. You should get one result, not two. Applies to anything that writes data — orders, payments, inference calls.

Why it matters Without a clear communication contract, services break each other silently. Wrong mode, no retry logic, no idempotency — and you get duplicate records, lost requests, and failures that are hard to trace back to the source.

Scenario A user clicks "place order." The request hits your order service, which calls the payment service. The charge goes through but the response never makes it back. The client retries. The user gets charged twice. What should have been in place?

2. Load Balancing

A load balancer distributes incoming traffic across multiple servers so no single one gets overwhelmed. If one server is busy, the next request goes elsewhere. If a server goes down, traffic routes around it automatically.

Not all requests take the same time. A short request finishes in milliseconds. A heavy one takes several seconds.

Round-robin ~ sending requests one by one to each server in turn — ignores that.

Least-connections is smarter. It sends the next request to whichever server is least busy right now.

Why it matters Without load balancing, one server takes all the heat while others sit idle. One slow server degrades the experience for every user hitting it — not just during peak traffic, but any time request volume is uneven.

Scenario Your API runs fine at 9am. By noon traffic triples. One server slows down but stays up. Users on that server are getting 10x slower responses than everyone else. What's missing in your setup?

3. Message Queues

A queue is a buffer that sits between the service sending work and the service doing the work. The sender drops a job into the queue and moves on. The worker picks it up when it's ready. If the worker crashes mid-job, the item goes back on the queue. Nothing is lost.

Without a queue, your API waits for every job to finish before responding. Under heavy traffic, everything backs up and requests start failing.

With a queue ~ your API accepts the request, adds it to the list, and responds immediately. The heavy lifting happens in the background.

SQS, Kafka, and RabbitMQ are the common implementations.

Why it matters Not everything needs a real-time response. Sending emails, processing uploads, generating reports, running batch inference — these are all background jobs. Queues let you handle them reliably without blocking your main application or losing work when something fails.

Scenario Your app lets users upload documents for processing. At 2pm, 500 users upload at the same time. Your processing service can handle 20 at once. Without a queue, what happens to the other 480 requests?

4. Model Routing

Not every request needs your most powerful, most expensive model. Model routing is a layer that looks at each incoming request and decides which model handles it.

Simple, repetitive, low-stakes queries go to a smaller, faster, cheaper model.

Complex reasoning and multi-step tasks go to the larger one.

Routing logic can be rules-based ~ route by prompt length, request type, or endpoint. Or classifier-based ~ a lightweight model scores the request and decides which backend handles it.

Why it matters At scale, sending every request to your largest model burns through your inference budget fast. Most production traffic is simple. Routing it to a smaller model cuts costs significantly without any drop in quality — and keeps your larger models free for work that actually needs them.

Scenario Your AI assistant handles 10,000 requests a day. 70% are simple one-line questions. 30% are complex multi-step tasks. You're routing everything to your largest model. What's the cost and latency impact — and how would routing change it?

5. RAG Pipeline Architecture

Your model was trained on data up to a cutoff date. It knows nothing about your internal documents, your product catalog, or anything recent. RAG fixes this without retraining the model.

It works in two phases.

Ingestion ~ your documents get split into chunks, each chunk gets converted into a vector, and those vectors get stored in a vector database.

Query ~ when a user asks a question, that question also gets converted into a vector, the database finds the most similar chunks, and those chunks get passed to the model as context. The model answers using your data.

The part that breaks most RAG systems quietly ~ chunking. Too large and you're feeding the model noise. Too small and you lose the context it needs to answer well.

Why it matters Almost every enterprise AI application — internal chatbots, document Q&A, knowledge bases — is built on RAG. Understanding this pipeline means you can build it, debug it, and improve retrieval quality when it drops.

Scenario You build a RAG system for a company's internal HR policy docs. Users keep getting incomplete or irrelevant answers even though the documents have the right information. The model and vector DB are working fine. Where do you look first?. Understanding this pipeline end to end is non-negotiable.

6. AI Observability

Observability is your ability to see what's happening inside your system in real time. For standard applications, CPU and memory are enough. For AI systems, those metrics often tell you nothing useful. A GPU node can be at 20% CPU and completely maxed out at the model level.

The metrics that actually matter ~

TTFT — Time to First Token. How long before the user sees any response. Eight seconds of silence feels broken even if the full response arrives quickly.

Token throughput. How many tokens per second your system generates. This is your capacity number.

P99 latency. Your worst-case response time. Averages hide the users having a terrible experience. If 1 in 100 users waits 45 seconds, your average looks fine but your product is broken for those users.

Queue depth. For async pipelines — are jobs piling up faster than workers can process them?

Why it matters You can't fix what you can't see. Standard monitoring tells you if your servers are running. AI observability tells you if your system is actually working — and where it's breaking down for real users.

Scenario Users are complaining the AI assistant feels slow. Your dashboard shows CPU at 30%, memory normal, no errors. Everything looks healthy. What metrics should you actually be looking at — and what might they reveal?

7. Vector Database Design

A vector database stores and retrieves information based on meaning, not exact words. "How do I reset my password?" and "I forgot my login" return the same results even though the words are different.

Under the hood, text gets converted into vectors — numbers that represent meaning. The database uses an index to find the closest matches quickly without scanning every entry.

Three things that matter in practice ~

Metadata filtering ~ scope the search before it starts. Only search documents from this customer or this date range. Without it, every query searches everything — slower and noisier.

Namespace isolation ~ keeps different customers' data separate inside the same database.

Embedding consistency ~ use one embedding model to store data and a different one at query time, and your results will be silently wrong. Same model at both ends, always.

Why it matters The vector database is the memory layer of your AI system. Poor design here breaks retrieval quality even when the model itself is working perfectly. Most RAG quality problems trace back to this layer.

Scenario You build a RAG system for two enterprise clients on the same infrastructure. Client A starts getting answers that reference Client B's internal documents. What went wrong ~ and what should have been in place from the start?

That’s it for this week.

Learn the pattern. Understand the tradeoff.

That's what sticks ~ in interviews and in production.

See you in the next one.

-V