New offer - be the first one to apply!

June 8, 2026

Lead DevOps Engineer

Senior • Remote

Łódź, Poland

About the role

We are looking for a Lead DevOps Engineer to provide technical leadership for DevOps and Site Reliability Engineering practices supporting large-scale GPU infrastructure used for AI training and inference workloads.

This role combines hands-on engineering with team leadership. You will be responsible for shaping automation standards, improving platform reliability, and leading a team working on software-defined infrastructure, high-performance networking, observability, and operational excellence across complex production environments.

Responsibilities

  • Lead, mentor, and support a team of DevOps and SRE engineers working across the full lifecycle of GPU infrastructure platforms

  • Design and implement Infrastructure as Code solutions for provisioning and managing bare-metal GPU servers, networking, storage, and cluster orchestration components

  • Build and improve CI/CD pipelines for infrastructure, platform services, and internal tooling

  • Develop and maintain monitoring, logging, alerting, and observability solutions for large-scale GPU environments

  • Define and track SLIs/SLOs, improve incident response processes, and contribute to post-incident reviews and long-term reliability improvements

  • Work closely with Infrastructure, Networking, Facilities, and AI/ML teams to ensure stable and scalable platform operations

  • Automate operational processes such as cluster scaling, firmware and BIOS updates, hardware diagnostics, and capacity planning

  • Support DevSecOps practices, including infrastructure hardening, vulnerability management, and compliance automation

  • Identify operational inefficiencies and reduce repetitive manual work through automation

  • Evaluate and introduce new tools and solutions related to GPU infrastructure, orchestration, and cloud-native operations

Requirements

  • 8+ years of experience in DevOps, SRE, Platform Engineering, or a similar area

  • At least 3 years of experience in a technical lead, lead engineer, or team leadership role

  • Strong practical experience with infrastructure automation in large-scale or complex production environments

  • Very good knowledge of Terraform, Ansible, Pulumi, Crossplane, or similar Infrastructure as Code tools

  • Experience with GitOps, configuration management, and CI/CD practices

  • Hands-on experience with Kubernetes

  • Experience working with GPU-related technologies such as NVIDIA GPU Operator, device plugins, MIG, or time-slicing

  • Good scripting or programming skills in Python, Go, or Bash

  • Experience with bare-metal provisioning, infrastructure automation, or data center environments

  • Good knowledge of observability tools such as Prometheus, Grafana, Loki, and OpenTelemetry

  • Good understanding of distributed systems reliability and production incident management

  • Experience with high-performance networking technologies such as RDMA, InfiniBand, or RoCE will be a strong advantage

  • Ability to lead technical discussions, support team development, and communicate effectively with both technical and business stakeholders

  • English proficiency at least at a communicative level is required, as you will be working in an international team

Nice to have

  • Experience in AI infrastructure, HPC environments, hyperscale infrastructure, or data center operations

  • Familiarity with orchestration and scheduling tools such as Slurm, Ray, Run:ai, KServe, or Kubernetes-based schedulers

  • Experience integrating telemetry from power, cooling, or environmental systems

  • Experience building internal platforms or self-service tools for engineering or research teams

  • Understanding of security, compliance, and audit requirements in regulated or security-sensitive environments

What we offer

  • Benefits package

  • Opportunity to shape the DevOps and SRE foundation for advanced GPU infrastructure supporting AI workloads

  • Real impact on the scalability, reliability, and operational standards of next-generation compute environments

  • Collaboration with experienced engineers across infrastructure, platform, and AI domains

  • A dynamic environment with space for ownership, technical leadership, and professional growth