New offer - be the first one to apply!
December 19, 2025
Senior • On-site
$168,000 - $258,750/yr
Santa Clara, CA , +2
NVIDIA DGX Cloud is searching for a highly technical Product Manager to guide Health Automation and Resilience efforts for AI infrastructure. This role is responsible for developing products for fault detection, failure classification, automated repair workflows, and resilience tooling that enables consistent GPU fleet performance. You will build the next generation of health automation capabilities including detection pipelines, classification mechanisms, repair automation, and distributed resilience methods.
The position lies at the crossroads of distributed systems, observability, GPU hardware, and cloud operations. You will collaborate with engineering teams to transform signals, telemetry, and operational lessons into automation infrastructure that improves cloud provider efficiency and end-user experience at scale. If you are motivated by building foundational systems that enable large AI clusters to operate dependably and efficiently, we would love to hear from you.
What You Will Be Doing:
Establish the product vision and strategy for Health Automation and Resilience across DGX Cloud and partner GPU fleets.
Partner with engineering on the architecture and delivery of software agents, services, control loops, and distributed health components.
Convert hardware signals, telemetry pipelines, and operational insights into automation systems that reduce manual intervention.
Work with cloud providers and enterprise operators to understand failure modes and operational challenges.
Develop product specifications, technical requirements, and validation criteria for both internal and open-source components.
Support go-to-market activities including documentation, demos, partner enablement, and release readiness.
Track trends in observability, SRE practices, distributed systems, and automated operations to define long-term strategy.
Lead product technical reviews, customer conversations, and planning sessions.
What we need to see:
Bachelor’s degree in Computer Science, Engineering, or a similar area, or equivalent experience.
8+ years of relevant experience including demonstrated experience leading technical products within cloud infrastructure, distributed systems, reliability engineering, or related fields.
Track record defining multi-quarter strategy and leading execution with multiple engineering teams.
Ability to craft clear product requirements, work directly with engineering partners on technical decisions, and compose system-level workflows.
Strong architectural understanding of control planes, telemetry systems, health monitoring, repair workflows, or automated remediation systems.
Understanding of telemetry signals, SLOs, failure modes, and repair workflows in production environments.
Experience building automation, resilience, or failure-recovery capabilities for large-scale cloud or HPC environments.
Experience working with open-source technologies or products for software developers.
Excellent communication skills across engineering, customers, and executives.
Ways to Stand Out from the Crowd:
Experience with GPU-accelerated compute, HPC systems, or large-scale AI clusters.
Knowledge of Kubernetes operators, node health workflows, autoscaling, or control-plane automation.
Experience with modern observability and diagnostics technologies such as Prometheus, OpenTelemetry, eBPF, or distributed tracing.
Contributions to infrastructure or reliability open-source communities.
Experience writing detailed build documents for software agents, distributed services, or platform-level components.
NVIDIA is widely considered to be one of the technology world’s most desirable employers! We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you!
#LI-Hybrid
You will also be eligible for equity and benefits.