New offer - be the first one to apply!
August 16, 2025
Senior • Hybrid • On-site • Remote
$184,000 - $287,500/yr
Santa Clara, CA
Joining NVIDIA's DGX Cloud Lepton Team means contributing to the infrastructure that powers our innovative AI research. This team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data infrastructure tools and services. Our objective is to deliver a stable, scalable environment for AI researchers, providing them with the necessary resources and scale to foster innovation. DGX Lepton delivers NVIDIA-managed GPU/Kubernetes capacity for AI workloads.
As a Senior System Engineer, you’ll own Lepton platform’s reliability and ensure security is a first-class part of day-to-day operations. You’ll have the autonomy to drive meaningful projects with strong mentorship and support. We practice blameless postmortems, iterate continuously, and encourage thoughtful risk-taking. If you’re looking for an impactful, rewarding role, we invite you to apply.
What you’ll be doing:
Platform fundamentals: design, build, and operate core services and node/cluster foundations for Lepton platform; automate deployments, upgrades, and day-2 operations.
Vulnerability & patch management: own intake, prioritization, rollout, and rollback rhythms across OS, drivers/firmware, and platform components for Lepton product.
Security as a product quality: define, deliver, and maintain secure-by-default baselines (host hardening, workload isolation, network segmentation, least-privilege access) for AI infrastructure at scale.
Identity & access stewardship: standardize patterns for service identity, role scoping, secrets handling, and certificate hygiene.
Trusted releases: drive change control and release practices that ensure traceability and integrity of what runs in production.
Monitoring & incident practice: establish health signals and SLOs; lead investigations, root causes, and follow-through actions that improve both reliability and security.
Risk & readiness: partner with product, SRE, and security stakeholders to assess risks for new features and close gaps with pragmatic controls.
Documentation & mentorship: publish runbooks and standards; review designs and coach engineers on secure operational practices.
What we need to see:
7+ years in systems/platform engineering operating large-scale, production environments.
Demonstrated ability to deliver secure, reliable platforms (hardening, access control, isolation, monitoring, and strong operational runbooks).
Experience with containerized/managed cluster environments; familiarity with GPU-accelerated platforms or the ability to ramp quickly.
Automation mindset with infrastructure-as-code and CI/CD; disciplined change management.
Clear communication and documentation skills; ability to turn requirements into practical, supportable designs.
Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).
Ways to stand out from the crowd:
Hands-on engineering experience of delivering and driving platform security baselines in multi-tenant environments.
Production Kubernetes experience (EKS/AKS/GKE) at fundamental level, especially private clusters and PSA restricted defaults.
Supply-chain basics at scale: signed images (cosign) enforced via policy-as-code (Kyverno/OPA).
Familiarity with NVIDIA GPU platforms (GPU Operator/device plugin, MIG-aware operations)
You will also be eligible for equity and benefits.