New offer - be the first one to apply!

December 19, 2025

Product Manager, Health Automation and Resilience

Senior • On-site

$168,000 - $258,750/yr

Santa Clara, CA , +2

NVIDIA DGX Cloud is searching for a highly technical Product Manager to guide Health Automation and Resilience efforts for AI infrastructure. This role is responsible for developing products for fault detection, failure classification, automated repair workflows, and resilience tooling that enables consistent GPU fleet performance. You will build the next generation of health automation capabilities including detection pipelines, classification mechanisms, repair automation, and distributed resilience methods.

The position lies at the crossroads of distributed systems, observability, GPU hardware, and cloud operations. You will collaborate with engineering teams to transform signals, telemetry, and operational lessons into automation infrastructure that improves cloud provider efficiency and end-user experience at scale. If you are motivated by building foundational systems that enable large AI clusters to operate dependably and efficiently, we would love to hear from you.

What You Will Be Doing:

  • Establish the product vision and strategy for Health Automation and Resilience across DGX Cloud and partner GPU fleets.

  • Partner with engineering on the architecture and delivery of software agents, services, control loops, and distributed health components.

  • Convert hardware signals, telemetry pipelines, and operational insights into automation systems that reduce manual intervention.

  • Work with cloud providers and enterprise operators to understand failure modes and operational challenges.

  • Develop product specifications, technical requirements, and validation criteria for both internal and open-source components.

  • Support go-to-market activities including documentation, demos, partner enablement, and release readiness.

  • Track trends in observability, SRE practices, distributed systems, and automated operations to define long-term strategy.

  • Lead product technical reviews, customer conversations, and planning sessions.

What we need to see:

  • Bachelor’s degree in Computer Science, Engineering, or a similar area, or equivalent experience.

  • 8+ years of relevant experience including demonstrated experience leading technical products within cloud infrastructure, distributed systems, reliability engineering, or related fields.

  • Track record defining multi-quarter strategy and leading execution with multiple engineering teams.

  • Ability to craft clear product requirements, work directly with engineering partners on technical decisions, and compose system-level workflows.

  • Strong architectural understanding of control planes, telemetry systems, health monitoring, repair workflows, or automated remediation systems.

  • Understanding of telemetry signals, SLOs, failure modes, and repair workflows in production environments.

  • Experience building automation, resilience, or failure-recovery capabilities for large-scale cloud or HPC environments.

  • Experience working with open-source technologies or products for software developers.

  • Excellent communication skills across engineering, customers, and executives.

Ways to Stand Out from the Crowd:

  • Experience with GPU-accelerated compute, HPC systems, or large-scale AI clusters.

  • Knowledge of Kubernetes operators, node health workflows, autoscaling, or control-plane automation.

  • Experience with modern observability and diagnostics technologies such as Prometheus, OpenTelemetry, eBPF, or distributed tracing.

  • Contributions to infrastructure or reliability open-source communities.

  • Experience writing detailed build documents for software agents, distributed services, or platform-level components.

NVIDIA is widely considered to be one of the technology world’s most desirable employers! We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you!

#LI-Hybrid

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 168,000 USD - 258,750 USD for Level 4, and 208,000 USD - 327,750 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until December 22, 2025.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.