New offer - be the first one to apply!

July 31, 2025

Senior Site Reliability Engineer, NIM Factory

Senior • Hybrid • On-site • Remote

$184,000 - $287,500/yr

Santa Clara, CA

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

Join NVIDIA, where we are redefining the future of AI and computing! As a Sr SRE Engineer for our NIM Factory, you will have the opportunity to operate and improve the automation of NVIDIA Inference Microservices (NIMs). This role is perfect for someone who is technically driven and creative, ready to change the way high-performance inferencing is delivered for AI models. You will be instrumental in ensuring the flawless performance, accuracy, and availability of our services, making a significant impact on AI-powered applications.

What you'll be doing:

  • Operate a software factory that transforms AI models into deployable services validated across Cloud, On-prem, and Kubernetes environments.

  • Collaborate with the development team to deliver rapid iterations on technical strategies and roadmaps, continuously evolving the NIM factory.

  • Ensure the factory's operation, availability, critical metrics, observability, and stability while tracking service deployment across multiple cloud hosts.

  • Partner with internal and external SRE teams to provide the best experience for developers and users, securing infrastructure with robust configurations and management.

  • Collaborate broadly with AI model teams to build an efficient infrastructure, driving improvements based on user feedback and mentoring team members.

  • Participate in On-call rotation for maintaining reliability of NVIDIA NIMs and NIM Factory

What we need to see:

  • Advanced system engineering skills in operating and improving the observability and maintainability of distributed microservices cloud applications.

  • Proven experience in working with multi-functional teams, principals, architects, and across organizational boundaries.

  • Demonstrated ability to mentor teams, grow team members, and adapt to the needs of customers.

  • Experience in operating distributed containerized applications using Docker, K8s, Cloud Endpoints, Helm, and Prometheus, and using Infrastructure as Code tools like Terraform, Puppet, or Ansible.

  • Skilled in pinpointing issues in cloud systems, understanding security for public cloud services.

  • BS or MS in Computer Science, Computer Engineering, or equivalent experience.

  • 8+ years of experience as an SRE or Developer working on high-performance microservices and cloud software.

Ways to stand out from the crowd:

  • Excellent communication and interpersonal skills for engaging a multi-functional team.

  • Experience with event-driven applications using services such as Temporal, Airflow, Kafka, or Redis.

  • Background of building and deploying containers for Microservices, Cloud, and On-prem deployments, along with their associated CI/CD pipelines.

  • A history of dealing with high cardinality and dimensions of metrics

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until August 3, 2025.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.