New offer - be the first one to apply!
September 24, 2025
Senior • Hybrid • On-site • Remote
$184,000 - $287,500/yr
Santa Clara, CA , +1
As a Senior Machine Learning Engineer at NVIDIA, you will build the machine learning brain that keeps NVIDIA’s global DGX Cloud healthy, efficient and ready for the next waves of AI breakthroughs. DGX Cloud fuses NVIDIA GPUs, NVLink networking and the full AI software stack into elastic infrastructure powering large language models, drug discovery, autonomous driving and climate science. Your models will turn billions of telemetry signals into predictive insight. This frees customers to innovate while our platform runs smarter.
What you'll be doing:
Ground breaking and developing innovative machine learning algorithms and models that propel our AI products.
Build production models for anomaly detection, predictive maintenance and usage optimization.
Develop tools surfacing real time telemetry, efficiency metrics and long term trends.
Develop forecasting and simulation models for global scale planning.
Analyzing complex datasets to determine the best approach for model training and optimization.
Translate findings into clear engineering actions with infrastructure, operations and product teams.
Participating in cross-functional projects to integrate machine learning capabilities into various NVIDIA products.
What we need to see:
Master's degree or PhD in Mathematics, Statistics, Machine Learning or related quantitative field (or equivalent experience).
8+ years experience applying Machine Learning to operational systems.
Proven track record of building and deploying Machine Learning models in production environments.
Experience with time series analysis and optimization algorithms.
Familiarity with distributed systems and cloud platforms such as AWS and Kubernetes.
Strong software engineering skills and proficiency in Python.
Effective verbal/written communication, and technical presentation skills.
Experience with machine learning frameworks such as TensorFlow, PyTorch, or similar.
A track record of delivering high-impact projects to compete in a fast-paced environment.
Ways to stand out from the crowd:
Experience solving capacity planning problems.
Deep understanding of GPU performance metrics.
Familiarity with prometheus and PromQL.
You will also be eligible for equity and benefits.