Vishakha Sadhwani
Posts
3 Skills that's working insanely well in Cloud/DevOps

3 Skills that's working insanely well in Cloud/DevOps

|| SPECIAL EDITION ||

Vishakha Sadhwani
August 03, 2025

Hi Inner Circle,

Today, I want to spotlight 3 powerful skills (or rather say infrastructure stack) that are actually moving the needle in Cloud, DevOps, and SRE roles right now.

Whether you're prepping for interviews, trying to stand out at work, or planning your next big career jump — these skills are showing up again and again in job listings, real-world incident reports, and engineering case studies.

Let’s dive in.

1. Terraform — But Not Just Basics Anymore

Infrastructure as Code is foundational — but in 2025, Terraform is no longer about “just knowing HCL syntax” or spinning up a compute instance.

Here’s what interviewers are really looking for now:

Can you modularize reusable infrastructure that supports multi-environment setups (dev/staging/prod) without duplicating code?
Are you using workspaces, remote state with locking, and secure backends like S3 + DynamoDB or Terraform Cloud?
Can you scale your codebase to manage GPU-backed workloads, autoscaling clusters for ML training jobs, and permissions for AI pipelines?

With AI models needing specialized infrastructure (think: Kubernetes + Karpenter, spot instances, and high-throughput storage), Terraform becomes the central layer to codify and automate all of it.

And interviewers want to see how you design at scale, not just deploy simple resources.

Where to start if you're new:

Guide: Terraform Modules: Best Practices from HashiCorp
Video Tutorial: Terraform on AWS (FreeCodeCamp) — great for building solid foundations

2. Kubernetes — Extending It Like a Platform Engineer

Knowing how to deploy pods is no longer enough. With AI workloads scaling rapidly, Kubernetes is evolving into a platform layer — and engineers are expected to treat it like an operating system you can extend.

Interviewers now expect you to understand:

How to use and write Custom Resource Definitions (CRDs) to model things like InferenceJob, BatchTrainer, or DataPipeline
What role controllers and reconciliation loops play in managing dynamic, event-driven systems
How to enforce guardrails using Admission Controllers, or policy engines like OPA/Gatekeeper and Kyverno

I get it — this can sound a bit advanced. But if you’ve already worked with Kubernetes basics like Pods, Services, ReplicaSets, and Deployments, then CRDs are your next leap.

They’re basically custom object types built on top of the same framework — and they let you define infrastructure and behavior in your own terms (which is required now more than EVER)

If you can explain the tradeoffs between using a Helm chart and writing a Kubernetes Operator, you're signaling platform-level thinking — the kind that scales in modern, AI-first orgs.

Where to start if you're new:

KodeKloud Labs: hands-on free practice resources to play with kubernetes
Kubernetes Operator Resources— practice extending Kubernetes
K8s Cert Playbook

3. Cloud Observability — From Metrics to Meaning

As cloud workloads become more distributed — especially with training jobs, inference endpoints, and GPU-based compute — visibility isn't optional. It’s essential.

Modern observability isn't just about graphs — it’s about real insight.

Cloud Observability with Prometheus & Grafana
Pic Credits - Uptrace

Here’s what teams (and hiring panels) are looking for:

Engineers who can build full-stack dashboards that span compute, storage, and AI-specific metrics like GPU usage, model latency, and token throughput
Ability to define meaningful SLIs/SLOs that reflect user experience, not just backend metrics
Familiarity with OpenTelemetry, Prometheus, Grafana, and Loki for collecting logs, traces, and metrics in a unified way
Systems that include proactive alerting with thresholds, anomaly detection, and minimal noise

If you can instrument and monitor a production AI pipeline with clarity — you’re speaking the same language as the people responsible for keeping multimillion-dollar workloads online.

Where to start if you're new:

Beginner Guide: Monitoring Kubernetes with Prometheus + Grafana (youtube tutorial)

In Summary:

If you want to stay ahead in Cloud and DevOps — and not get stuck maintaining yesterday’s stack — these are the 3 skills to double down on:

Terraform — for codifying complex, scalable infrastructure
Kubernetes extensions — for building smarter, self-managing platforms
Cloud observability — to turn chaos into insight, especially in AI-first systems

Pick the one you’re least confident in, and dive into just one project this week.

These aren’t skills you get overnight — they’re capabilities that compound over time.

Next Thursday — we’ll break down a real-world platform deployment and how teams are stitching these three layers together.

— Vishakha