New offer - be the first one to apply!
September 8, 2025
Senior • Hybrid • On-site • Remote
$184,000 - $287,500/yr
Santa Clara, CA
NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions, from artificial intelligence to autonomous cars. NVIDIA is looking for great people like you to help us accelerate the next wave of artificial intelligence.
The team delivers NVIDIA Mission Control Software that runs on superpods. The software we develop is shipped as an autonomous hardware recovery engine and is responsible for baseline validation tests, taking remedial actions (break/fix workflows), and periodic health checks for hardware components. We are looking for a Senior Software Engineer with experience in building highly scalable and robust enterprise software to join us. We are building and improving a powerful platform that will automate the diagnosis and repair of a cluster of GPUs or CPUs across public clouds, private clouds, and virtual and physical hardware.
What you'll be doing:
Designing and implementing scalable and reliable software components to enable the core platform to maintain an inventory of resources, including hosts, GPUs, and switches; to automate actions to diagnose failures, and to repair
Enabling Agentic AI within the core platform to create remedial workflows
Influencing the product roadmap in collaboration with teams across various departments with the goal of reducing SRE toil and improving hardware utilization
Collaborating with various organizations across Nvidia to drive adoption of the platform in order to improve GPU utilization
Defining and running benchmarks for various subsystems
Leading and delivering high-impact projects with high quality, performance, and stability with the lowest resource consumption
Developing a robust feedback control system that analyzes signals about system health and automatically runs commands to fix discovered issues
Programming in modern languages like Go and Rust
What we need to see:
Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent experience)
Keen interest in driving Agent AI projects
10 years of equivalent experience
Demonstrated ability in building scalable and robust distributed systems
Proven record of product rollouts and collaborating with early adopters
Proficiency in programming in C/C++, Java, Rust or Go.
Technical stewardship of projects across the organization
Ways to stand out from the crowd:
Deep understanding of multi-threading and distributed systems concepts
Excellent track record of delivering projects
Expertise in optimizing SQL queries
Expert-level knowledge of Go/Rust programming
With competitive salaries and a generous benefits package, NVIDIA is widely considered to be one of the technology industry's most desirable employers. We have some of the most forward-thinking and versatile people in the world working with us, and our engineering teams are growing fast in some of the most impactful fields of our generation: Cloud Engineering and Cloud Functions. If you're a creative engineer who enjoys autonomy and shares our passion for technology, we want to hear from you.
NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and dedicated people in the world working for us. If you're creative and passionate about developing cloud services we want to hear from you!
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.You will also be eligible for equity and benefits.