Top 25 SRE & DevOps Interview Questions in India (With Answer Frameworks)
Published
Platform and SRE interviews in India typically blend Linux fundamentals, cloud and Kubernetes, CI/CD, Terraform, and behavioural reliability questions. This guide lists 25 questions we see reflected in real job descriptions and interview loops — with frameworks for strong answers, not rote scripts.
Use this alongside live listings on FzlOps to match interview prep to the stacks employers hire for. For Udemy course picks by topic, see our Best DevOps Courses guide.
Linux & systems (5 questions)
1. A service is slow. How do you troubleshoot on a Linux server?
Framework: TOP/down — user impact → metrics → process → network → disk.
Mention: top/htop, ss/netstat, journalctl, df -h, iostat, checking recent deploys, correlation with traffic spikes. End with how you’d communicate status during investigation.
2. Explain the difference between hard links and soft links.
Hard link: same inode, same file data; cannot cross filesystems; no dangling links. Symlink: pointer path; can cross filesystems; can break if target removed. Tie to deployment patterns (symlinks for release directories).
3. What happens when you run docker run vs kubectl apply?
Docker run: single container on one node, local lifecycle. kubectl apply: declarative desired state to API server; controllers reconcile (Deployment → ReplicaSet → Pods). Shows you understand orchestration vs single-container mental model.
4. How do you debug exit code 137 / OOMKilled?
137 = SIGKILL, often OOM. Check container limits, node memory pressure, dmesg, application memory leaks, JVM heap if applicable. Fix: adjust limits, fix leak, scale horizontally, or tune workload.
5. Describe file permissions chmod 755 vs 644.
755: rwx for owner, rx for group/other (typical dirs/binaries). 644: rw for owner, r for group/other (typical files). Security angle: avoid world-writable; principle of least privilege.
Linux Administration Bootcamp → · Python Bootcamp →
Kubernetes (6 questions)
6. Walk through what happens when a Pod is scheduled.
API server accepts spec → scheduler assigns node → kubelet pulls images → CNI sets networking → probes run → Service endpoints update if ready. Optional: mention admission controllers.
7. Deployment vs StatefulSet — when use each?
Deployment: stateless, interchangeable pods. StatefulSet: stable network ID, ordered rollout, persistent volumes per pod (databases, Kafka brokers).
8. How do liveness and readiness probes differ?
Liveness: restart if failing (deadlock). Readiness: remove from Service endpoints if not ready (startup, dependency down). Wrong liveness kills healthy-but-slow pods.
9. Explain Horizontal Pod Autoscaler (HPA).
Scales replica count based on metrics (CPU, memory, custom). Needs metrics-server or custom metrics API. Mention limits: lag, cold start, cluster capacity.
10. How would you roll back a bad deployment?
kubectl rollout undo, or Git revert + CI redeploy for GitOps. Mention monitoring error rate during rollout, maxUnavailable/maxSurge strategy.
11. What is a NetworkPolicy?
Kubernetes resource defining allowed ingress/egress between pods. Requires CNI support (Calico, Cilium). Shows zero-trust awareness inside cluster.
Best Kubernetes courses to prep →
CI/CD & GitOps (5 questions)
12. Design a CI pipeline for a microservice.
Stages: lint → unit test → build image → scan → push registry → deploy staging → integration test → manual/auto promote prod. Mention secrets in CI, immutable tags, SBOM optional.
13. Blue-green vs canary vs rolling deployment.
Rolling: incremental replace. Blue-green: two full envs, switch traffic. Canary: small % traffic to new version, metric-gated promotion. Tradeoffs: cost, blast radius, rollback speed.
14. Where do secrets belong in CI/CD?
Never in repo. Use vault/SSM/Sealed Secrets/external secret operator. CI gets short-lived credentials via OIDC where possible.
15. What is GitOps?
Git as source of truth; controller (Argo CD, Flux) reconciles cluster to repo. Benefits: audit trail, rollback, consistency. Challenges: secret handling, drift detection.
16. How do you speed up slow pipelines?
Parallel jobs, layer caching for Docker, test splitting, smaller images, selective deploys (monorepo tools), self-hosted runners for heavy workloads.
Best CI/CD & GitOps courses to prep →
Terraform & cloud (5 questions)
17. What is Terraform state and why does it matter?
Tracks real-world resources mapped to config. Remote state (S3 + locking) for teams. State loss = orphan resources or duplicate creates.
18. Explain plan vs apply.
Plan: dry-run diff. Apply: execute changes. Always plan in CI for review; apply with approval gates in prod.
19. How do you structure Terraform for multiple environments?
Workspaces, separate state per env, or directory per env with shared modules. Consistent module versioning; avoid copy-paste.
20. IAM least privilege for a CI deploy role.
Scope to specific cluster/service ARNs, use OIDC federation from GitHub/GitLab, no long-lived keys, separate roles per env.
21. Compare AWS EKS vs self-managed Kubernetes.
EKS: managed control plane, AWS integration, cost of control plane + ops for nodes. Self-managed: more control, more burden. Most Indian enterprises default to managed offerings.
Best Terraform, cloud & Ansible courses to prep →
Observability & SRE mindset (4 questions)
22. What are the four golden signals?
Latency, traffic, errors, saturation (Google SRE). Explain how you’d instrument a web service (RED/USE methods as alternative framing).
23. Describe an incident you handled.
STAR format: Situation, Task, Action, Result. Include detection, comms, mitigation, root cause, preventive action item. Interviewers care about calm execution and learning.
24. Error budget — what is it?
Allowed unreliability derived from SLO. When budget burns, focus on reliability over features. Connects product and platform priorities.
25. How do you reduce alert fatigue?
Fix flaky alerts, SLO-based paging, runbooks, alert ownership, aggregate related alerts, post-incident review of noise.
Best observability courses to prep →
How Indian interview loops differ by company type
| Company type | Typical loop |
|---|---|
| Product startup | Take-home or live coding lite + system design + culture |
| GCC / enterprise | Structured rounds, behavioural, cloud cert preferred |
| IT services | Client stack match, sometimes faster process |
Always ask recruiting: number of rounds, on-call expectation, stack on day one.
Prep plan for the next two weeks
Days 1–3: Linux + networking refresh; practice one troubleshooting narrative aloud
Days 4–7: Kubernetes (CKA syllabus covers most interview K8s depth)
Days 8–10: One Terraform module exercise; explain state remote backend
Days 11–12: Write two STAR incident stories
Days 13–14: Mock interview; apply to 3 roles where you’d accept an offer
After the interview
Send a brief thank-you if you have a contact. Note questions you missed — those become your next study list. Interview prep compounds: the same Kubernetes and CI/CD themes recur across companies.
When you’re ready to practice against real reqs, browse DevOps and platform roles on FzlOps and align your stories to the stacks listed in each description.
Good luck — consistency beats cramming.
Ready to apply? Browse open roles on FzlOps.