August 16, 2025

Senior System Engineer – DGX Cloud Lepton

Senior • Hybrid • On-site • Remote

$184,000 - $287,500/yr

Santa Clara, CA

Joining NVIDIA's DGX Cloud Lepton Team means contributing to the infrastructure that powers our innovative AI research. This team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data infrastructure tools and services. Our objective is to deliver a stable, scalable environment for AI researchers, providing them with the necessary resources and scale to foster innovation. DGX Lepton delivers NVIDIA-managed GPU/Kubernetes capacity for AI workloads.

As a Senior System Engineer, you’ll own Lepton platform’s reliability and ensure security is a first-class part of day-to-day operations. You’ll have the autonomy to drive meaningful projects with strong mentorship and support. We practice blameless postmortems, iterate continuously, and encourage thoughtful risk-taking. If you’re looking for an impactful, rewarding role, we invite you to apply.

What you’ll be doing:

Platform fundamentals: design, build, and operate core services and node/cluster foundations for Lepton platform; automate deployments, upgrades, and day-2 operations.
Vulnerability & patch management: own intake, prioritization, rollout, and rollback rhythms across OS, drivers/firmware, and platform components for Lepton product.
Security as a product quality: define, deliver, and maintain secure-by-default baselines (host hardening, workload isolation, network segmentation, least-privilege access) for AI infrastructure at scale.
Identity & access stewardship: standardize patterns for service identity, role scoping, secrets handling, and certificate hygiene.
Trusted releases: drive change control and release practices that ensure traceability and integrity of what runs in production.
Monitoring & incident practice: establish health signals and SLOs; lead investigations, root causes, and follow-through actions that improve both reliability and security.
Risk & readiness: partner with product, SRE, and security stakeholders to assess risks for new features and close gaps with pragmatic controls.
Documentation & mentorship: publish runbooks and standards; review designs and coach engineers on secure operational practices.

What we need to see:

7+ years in systems/platform engineering operating large-scale, production environments.
Demonstrated ability to deliver secure, reliable platforms (hardening, access control, isolation, monitoring, and strong operational runbooks).
Experience with containerized/managed cluster environments; familiarity with GPU-accelerated platforms or the ability to ramp quickly.
Automation mindset with infrastructure-as-code and CI/CD; disciplined change management.
Clear communication and documentation skills; ability to turn requirements into practical, supportable designs.
Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).

Ways to stand out from the crowd:

Hands-on engineering experience of delivering and driving platform security baselines in multi-tenant environments.
Production Kubernetes experience (EKS/AKS/GKE) at fundamental level, especially private clusters and PSA restricted defaults.
Supply-chain basics at scale: signed images (cosign) enforced via policy-as-code (Kyverno/OPA).
Familiarity with NVIDIA GPU platforms (GPU Operator/device plugin, MIG-aware operations)

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until August 19, 2025.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Nvidia

NVIDIA Corporation founded in 1993 by Jen-Hsun Huang, Chris Malachowsky, and Curtis Priem, NVIDIA Corporation has carved out a leading position in the technology industry. Based in Santa Clara, California, NVIDIA is renowned for its GeForce series of GPUs, which cater to both gaming and professional applications. The company's innovative graphics processing units are integral to various sectors, from gaming to machine learning and data centers. As a frontrunner in the semiconductor industry, NVIDIA continues to leverage emerging technologies like AI and machine learning to stay ahead of the curve.