New offer - be the first one to apply!

December 17, 2025

Principal Site Reliability Engineer, Cloud AI

Senior • On-site

$294,000 - $414,000/yr

Sunnyvale, CA

Minimum qualifications:

  • Bachelor’s degree in Computer Science, related field, or equivalent practical experience.
  • 15 years of experience in software engineering.
  • 10 years of experience working on reliability, scalability, and security of large-scale distributed systems.
  • Experience with ML systems, infrastructure, or a related AI/ML field.

Preferred qualifications:

  • PhD in Electrical Engineering, Computer Science, or related field.
  • Experience in reliability/performance engineering at a hyperscaler or a company known for managing datasets and large teams of data scientists.
  • Knowledge of enterprise security principles.
  • Deep expertise in managing large-scale resource pools, such as GPU/TPU clusters.
  • A track record of success working on products across a broad portfolio, including platforms like agent-building tools, or AI search.

About the job

We are seeking a Principal Engineer for Cloud AI Reliability, Resiliency, and Scalability. This technical individual contributor sits within the Cloud AI SRE team, where they will be a key voice in the architectural design and operational strategy of Google's Cloud AI portfolio. One of the many critical components of the Cloud AI portfolio is the Vertex AI platform; this platform runs both first-party models like Gemini and external third-party models, so a focus on breadth over depth is essential. In this role, you will be responsible for ensuring the availability and scalability of our most impactful AI products, working on a planetary scale.

A secondary part of the role will involve managing enterprise risk, with a key focus on security for the products and platforms we support. You will serve as a counterpart to senior leaders and domain experts, advising on the architectural and security considerations required to launch our next generation of AI and AI agent platforms.

The ML, Systems, & Cloud AI (MSCA) organization at Google designs, implements, and manages the hardware, software, machine learning, and systems infrastructure for all Google services (Search, YouTube, etc.) and Google Cloud. Our end users are Googlers, Cloud customers and the billions of people who use Google services around the world.

We prioritize security, efficiency, and reliability across everything we do - from developing our latest TPUs to running a global network, while driving towards shaping the future of hyperscale computing. Our global impact spans software and hardware, including Google Cloud’s Vertex AI, the leading AI platform for bringing Gemini models to enterprise customers.

The US base salary range for this full-time position is $294,000-$414,000 + bonus + equity + benefits. Our salary ranges are determined by role, level, and location. Within the range, individual pay is determined by work location and additional factors, including job-related skills, experience, and relevant education or training. Your recruiter can share more about the specific salary range for your preferred location during the hiring process.

Please note that the compensation details listed in US role postings reflect the base salary only, and do not include bonus, equity, or benefits. Learn more about benefits at Google.

Responsibilities

  • Provide expert-level guidance on the architectural design of highly available, scalable, and secure AI and ML systems.
  • Advise on and manage the overall enterprise risk for our AI products and platforms, with a significant focus on identifying and mitigating security vulnerabilities.
  • Partner with engineering and product teams to architect, launch, and operate the next generation of Google's AI and AI agent platforms, built from the ground up for the future of AI.
  • Represent the SRE perspective in highly technical discussions with other senior leaders and domain experts, focusing on the infrastructure and underlying systems that power our models.
  • Influence the platform's long-term strategy, ensuring it can support a wide range of first- and third-party models for all GCP and AI Studio enterprise customers.