API design in AI systems

From external traffic to running inference containers!!

Hi Inner Circle!!

Welcome to this week's edition.

There's one system design concept that hasn't lost its importance across traditional systems and modern AI applications: API Design.

APIs are the backbone of production systems. They connect services, expose functionality, and help your applications communicate with each other.

And yes ~ your AI models depend on them too.

Every engineer.. I repeat, every engineer - should understand the role APIs play in system design.

Because at the end of the day, users don't interact with your database, your Kubernetes cluster, or your model. They interact with your API.

APIs aren't new.. they've been around for decades. But the rules that worked for traditional applications start to change when you put an LLM behind them.

This week we cover some ~ API design fundamentals, common patterns, and the lessons every engineer should know when building both traditional and AI-powered systems.

Let's get into it ~

Traditional API Design ~ The Foundations Still Matter

Before we talk about AI, you have to understand what every API has to get right.

Routing decides which function handles which request. A clean structure like /users/{id} or /orders/{id}/items makes APIs predictable and easy to debug. Sloppy routing creates dead endpoints and inconsistent behavior.

Idempotency means the same request can be retried without breaking things. A POST /payment that runs twice and charges the user twice is an idempotency failure. Good APIs use idempotency keys so retries are safe.

Statelessness means the server doesn't remember anything between requests. Every call carries its own context. This is what lets you scale horizontally ~ add more servers, and any one of them can handle any request.

Error handling is where most APIs fail. Return the right status code, a clear message, and enough detail to debug ~ without leaking internals.

And there’s a few more, but these are the critical ones.

Before we dive deeper into API design, here’s something relevant for anyone building AI applications

Run open-source LLMs in real production.

Capture live traffic, fine-tune and optimize models, then deploy your own checkpoints to dedicated GPU endpoints.

Choose your hardware, set scaling limits, and select deployment regions ~ with stable latency, predictable costs, and clear data residency.

From LLM to production system, all in one platform.

What Changes With AI APIs

A traditional API returns a database row in milliseconds. An AI API does something fundamentally different ~ and that difference reshapes every design choice.

Inference is slow. A model can take 2–30 seconds to respond. You need to stream tokens as they're generated ~ via Server-Sent Events or WebSockets.

Idempotency works differently. Retrying an LLM call wastes GPU time. AI APIs need prompt caching, request deduplication, and idempotency keys tied to the prompt and parameters.

Batching is the difference between a $500/month GPU bill and a $50,000 one. A GPU running one request at a time wastes 90% of its capacity. Continuous batching (vLLM, TGI) swaps requests in and out at the token level.

Rate limiting needs a rethink. Count tokens per minute ~ not requests. A 50-token prompt and a 50,000-token prompt cost wildly different amounts of GPU time.

Model routing means not every request needs your most powerful model. Route dynamically based on task complexity, token count, or cost thresholds. Simple requests go to a smaller model. Heavy ones escalate. Lower costs, faster responses, expensive model reserved for work that actually needs it.

Cost attribution means knowing who's spending what. Aggregate bills tell you nothing. Tag every request with user ID, feature ID, and model version. Track tokens per request. Roll it up by tenant. It's not a finance problem ~ it's an infrastructure problem.

Safety means validating output, not just input. Before a response reaches the client, run it through PII detection, content filtering, and guardrails. Tools like Guardrails AI and LlamaGuard plug into this layer. Build it in from day one.

Observability shifts. Measure time-to-first-token, tokens per second, GPU utilization, queue depth, and cost per request.

This is where FastAPI framework quietly became the default ~ async by design, built-in streaming, automatic OpenAPI docs, same language as the model code.

3 Mistakes Engineers Make When Wrapping an LLM in an API

~ Treating it like a REST API and blocking on the full response instead of streaming

~ Rate limiting by request count instead of token count

~ Skipping batching and wondering why GPU costs are out of control

The Infrastructure Layer ~ Where APIs Actually Live

Your FastAPI service sits behind an API gateway ~ handling auth, rate limiting, routing, and logging before requests ever hit your model.

In traditional cloud systems: AWS API Gateway, Apigee, Cloud-based Gateways, Kong and more.

In Kubernetes, the Gateway API separates concerns cleanly between cluster operators and app developers, with traffic splitting and richer protocol support.

For AI workloads, the Kubernetes Inference Gateway adds model-aware routing based on model name, load, or latency targets.

A well-designed AI API isn't just a FastAPI route. It's a routing layer, an inference layer, a batching layer, and an observability layer ~ all working together.

Where to Focus

If you're building AI APIs, focus in this order:

~ get the traditional fundamentals right : routing, idempotency, error handling

~ pick a framework built for async and streaming (FastAPI is the safe default)

~ add dynamic or continuous batching with vLLM or TGI as your inference backend

~ put it behind a gateway : API Gateway, Apigee, or Kubernetes Gateway API

~ measure tokens, not just requests

The model is the easy part. The API is what makes it production-grade.

3 Resources you must check out:

That’s it for today!!

If there’s a specific AI system design topic you’d like me to cover next, just reply back and let me know.

Until then, keep building, keep learning, and keep shipping.

See you in the next one.

-V