
K8S Lifecycle Automation Engineer (RARR Job 5420)
Job Skills
Job Description
Position Overview
We are seeking a Senior Kubernetes Platform Engineer to design and implement the ZeroTouch Build, Upgrade, and Certification pipeline for our on-premises GPU cloud platform. This role will focus on automating the Kubernetes layer and its dependencies (e.g., GPU drivers, networking, runtime) using 100% GitOps workflows. You will collaborate across teams to deliver a fully declarative, scalable, and reproducible infrastructure stack — from hardware to Kubernetes and platform services.
Key Responsibilities
-
Architect and implement GitOps-driven Kubernetes cluster lifecycle automation using tools like kubeadm, ClusterAPI, Helm, and Argo CD.
-
Develop and manage declarative infrastructure components for:
-
GPU stack deployment (e.g., NVIDIA GPU Operator)
-
Container runtime configuration (Containerd)
-
Networking layers (CNI plugins such as Calico, Cilium, etc.)
-
-
Lead automation initiatives to enable zero-touch upgrades and certification pipelines for Kubernetes clusters and workloads.
-
Maintain Git-backed sources of truth for all platform configurations and integrations.
-
Standardize deployment practices for multi-cluster GPU environments ensuring scalability, repeatability, and compliance.
-
Integrate observability, testing, and validation into continuous delivery (e.g., cluster conformance, GPU health checks).
-
Collaborate with infrastructure, security, and SRE teams to ensure smooth handoffs between hardware/OS and Kubernetes platform layers.
-
Mentor junior engineers and shape the platform automation roadmap.
Required Skills & Experience
-
10+ years of hands-on infrastructure engineering experience with strong Kubernetes focus.
-
Core expertise in: Kubernetes API, Helm templating, Argo CD, GitOps integration, Go/Python scripting, Containerd.
-
Deep knowledge of:
-
Kubernetes cluster management (kubeadm, ClusterAPI)
-
Argo CD for GitOps-based delivery
-
Helm for application and cluster add-on packaging
-
Containerd as a container runtime in GPU workloads
-
-
Experience deploying & managing NVIDIA GPU Operator or equivalent in production.
-
Strong understanding of CNI plugin ecosystems, network policies, and multi-tenant networking.
-
Proven track record with Infrastructure-as-Code using Git-based workflows.
-
Experience building Kubernetes clusters in on-premises environments (vs managed cloud services).
-
Solid scripting/automation skills (Bash, Python, Go).
-
Familiarity with Linux internals, systemd, and OS-level tuning for container workloads.
Preferred / Bonus Skills
-
Experience developing custom controllers/operators or Kubernetes API extensions.
-
Contributions to Kubernetes or CNCF projects.
-
Exposure to service meshes, ingress controllers, or workload identity providers.