June 8, 2026

Lead Linux System Administrator

Senior • Remote

Łódź, Poland

About the role

We are looking for a Lead Linux System Administrator to take technical ownership of the Linux environment supporting large-scale GPU infrastructure used for AI training and inference workloads.

This role combines hands-on system administration with team leadership. You will be responsible for the stability, performance, security, and day-to-day management of Linux-based GPU servers, while also supporting and mentoring a team of administrators working in a complex production environment.

Responsibilities

  • Lead, mentor, and support a team of Linux System Administrators responsible for GPU infrastructure operations

  • Manage the full Linux server lifecycle, including provisioning, patching, configuration management, hardening, and performance tuning

  • Maintain and optimize the NVIDIA GPU software stack, including drivers, CUDA, cuDNN, NCCL, and GPU management tools such as DCGM and nvidia-smi

  • Support and manage MIG and GPU time-slicing configurations where needed

  • Develop and maintain automation for bare-metal provisioning, OS image management, and server configuration using tools such as Ansible, Terraform, and scripting

  • Tune Linux systems for demanding workloads, including kernel parameters, local storage, parallel file systems, networking, and scheduler settings

  • Troubleshoot complex issues across hardware, drivers, the operating system, and cluster-level services

  • Work closely with DevOps/SRE, Site Operations, and AI/ML teams to ensure smooth integration between OS-level infrastructure and higher-level orchestration platforms

  • Support security hardening, vulnerability management, patch compliance, and operational standards across the server fleet

  • Participate in on-call support and contribute to continuous improvements in reliability, performance, and operational efficiency

Requirements

  • 7+ years of hands-on experience in Linux system administration in production environments

  • At least 3 years of experience in a technical lead, lead administrator, or people leadership role

  • Strong expertise in administering Linux systems at scale

  • Hands-on experience with NVIDIA GPUs in Linux environments, including drivers, CUDA ecosystem components, and GPU management tools

  • Strong experience with Ansible or other configuration management tools

  • Good scripting skills in Python and/or Bash

  • Experience with Infrastructure as Code and infrastructure automation

  • Good understanding of high-performance computing, storage systems, and high-speed networking technologies such as InfiniBand or RoCE

  • Experience supporting AI/ML or HPC workloads

  • Ability to troubleshoot complex production issues and work effectively in a high-availability environment

  • English proficiency at least at a communicative level is required, as you will be working in an international team

Nice to have

  • Experience with cluster management and orchestration tools such as Slurm, Kubernetes, or Run:ai

  • Familiarity with bare-metal provisioning tools and large server fleet management

  • Experience in AI infrastructure companies, hyperscalers, or HPC/research environments

  • Knowledge of Linux performance tuning for GPU-accelerated workloads

  • Higher education in Computer Science, Engineering, or a related field

What we offer

  • Benefits package

  • Opportunity to lead Linux infrastructure supporting advanced AI workloads at scale

  • Work with modern GPU hardware and software stacks in a technically demanding environment

  • Collaboration with experienced engineers across infrastructure, platform, and AI teams

  • A dynamic workplace with room for ownership, technical influence, and professional growth

Similar jobs you might like

Technology

ALTER GPU CENTER

Linux System Administrator

Mid

Remote

Łódź, Poland

🏢 Summary: Hands-on Linux System Administrator role supporting large-scale GPU infrastructure for AI training and inference workloads. The position focuses on deployment, maintenance, performance tuning, and reliability of Linux-based GPU servers in production environments. You will ensure stable, secure, and high-performance operations of GPU clusters integrated with orchestration and automation tools. 🗂️ Requirements: 4–8 years of hands-on Linux system administration experience in production, Experience with enterprise Linux distributions (Ubuntu, Debian, RHEL, Rocky), Experience managing Linux environments at scale, Practical experience with configuration management and infrastructure automation, Scripting skills in Python and/or Bash, Knowledge of performance tuning, storage systems, and high-speed networking (RDMA, InfiniBand, RoCE), Experience working with NVIDIA GPUs in Linux (drivers, CUDA, GPU monitoring tools), Ability to troubleshoot complex technical issues in production environments 📃 Skills: Linux, Ubuntu, Debian, RHEL, Rocky, Python, Bash, NVIDIA, CUDA, cuDNN, NCCL, DCGM, MIG, RDMA, InfiniBand, RoCE, Slurm, Kubernetes, Prometheus, Grafana, Kernel, Storage, Networking 🏢 Description: About the role We are looking for a Linux System Administrator to support the Linux environment behind large-scale GPU infrastructure used for AI training and inference workloads. This is a hands-on role focused on the deployment, maintenance, performance tuning, and reliability of Linux-based GPU servers. You will work closely with infrastructure and platform teams to keep the environment stable, secure, and ready for demanding production workloads. Responsibilities Install, configure, patch, and maintain Linux operating systems across GPU-based server environments Manage and support the NVIDIA GPU software stack , including drivers, CUDA, cuDNN, NCCL, DCGM, and MIG/time-slicing configurations Perform system performance tuning, kernel optimization, storage configuration, and networking setup for AI/HPC workloads Develop and maintain automation scripts and operational tooling using Python, Bash , or similar technologies Monitor system health, investigate alerts, and troubleshoot issues across hardware, drivers, operating systems, and cluster services Support bare-metal provisioning and integration with orchestration platforms such as Slurm or Kubernetes Work closely with Site Operations, DevOps/SRE, and AI/ML teams to support stable GPU cluster operations and infrastructure growth Participate in on-call support, incident response, root cause analysis, and post-incident improvement activities Support security hardening, patch compliance, vulnerability management, and operational standards across the server fleet Requirements 4–8 years of hands-on experience in Linux system administration in production environments Good knowledge of enterprise Linux environments, such as Ubuntu, Debian, Red Hat Enterprise Linux, or Rocky Linux Experience with Linux administration at scale Practical experience with configuration management, scripting, and infrastructure automation Good scripting skills in Python and/or Bash Good understanding of performance tuning, storage systems, and high-speed networking technologies such as RDMA, InfiniBand, or RoCE Experience working with NVIDIA GPUs in Linux environments , including drivers, CUDA components, and GPU monitoring tools, will be a strong advantage Ability to troubleshoot complex technical issues in production environments English proficiency at least at a communicative level is required, as you will be working in an international team Nice to have Experience in AI/ML, HPC, or large-scale data center environments Experience with bare-metal provisioning and fleet management Familiarity with Slurm, Kubernetes , or similar orchestration tools Knowledge of observability tools such as Prometheus and Grafana Familiarity with DCIM platforms Higher education in Computer Science, Engineering, or a related field What we offer Benefits package Opportunity to work on Linux infrastructure supporting advanced AI workloads Exposure to modern GPU hardware and high-performance computing technologies Collaboration with experienced engineers across infrastructure, platform, and AI teams A dynamic environment with room for ownership, learning, and professional growth

Technology

ALTER GPU CENTER

Lead DevOps Engineer

Senior

Remote

Łódź, Poland

🏢 Summary: Technical leadership role combining hands-on DevOps/SRE engineering with team management to build and operate large-scale GPU infrastructure for AI workloads. Focused on infrastructure automation, reliability, observability, and high-performance networking across complex production environments. Responsible for shaping IaC standards, CI/CD, and operational excellence for software-defined, GPU-based platforms. 🗂️ Requirements: 8+ years in DevOps, SRE, or Platform Engineering, 3+ years in technical leadership role, Experience with large-scale infrastructure automation, Proficiency in Infrastructure as Code tools, Experience with GitOps and CI/CD, Hands-on experience with Kubernetes, Experience with GPU technologies, Scripting or programming in Python, Go, or Bash, Experience with bare-metal provisioning, Knowledge of observability and monitoring tools, Understanding of distributed systems reliability, Experience with high-performance networking technologies, Ability to lead technical discussions and mentor engineers, English proficiency at communicative level 📃 Skills: Terraform, Ansible, Pulumi, Crossplane, GitOps, Kubernetes, NVIDIA, MIG, Python, Go, Bash, Prometheus, Grafana, Loki, OpenTelemetry, RDMA, InfiniBand, RoCE, CI/CD 🏢 Description: About the role We are looking for a Lead DevOps Engineer to provide technical leadership for DevOps and Site Reliability Engineering practices supporting large-scale GPU infrastructure used for AI training and inference workloads. This role combines hands-on engineering with team leadership. You will be responsible for shaping automation standards, improving platform reliability, and leading a team working on software-defined infrastructure, high-performance networking, observability, and operational excellence across complex production environments. Responsibilities Lead, mentor, and support a team of DevOps and SRE engineers working across the full lifecycle of GPU infrastructure platforms Design and implement Infrastructure as Code solutions for provisioning and managing bare-metal GPU servers, networking, storage, and cluster orchestration components Build and improve CI/CD pipelines for infrastructure, platform services, and internal tooling Develop and maintain monitoring, logging, alerting, and observability solutions for large-scale GPU environments Define and track SLIs/SLOs , improve incident response processes, and contribute to post-incident reviews and long-term reliability improvements Work closely with Infrastructure, Networking, Facilities, and AI/ML teams to ensure stable and scalable platform operations Automate operational processes such as cluster scaling, firmware and BIOS updates, hardware diagnostics, and capacity planning Support DevSecOps practices, including infrastructure hardening, vulnerability management, and compliance automation Identify operational inefficiencies and reduce repetitive manual work through automation Evaluate and introduce new tools and solutions related to GPU infrastructure, orchestration, and cloud-native operations Requirements 8+ years of experience in DevOps, SRE, Platform Engineering , or a similar area At least 3 years of experience in a technical lead, lead engineer, or team leadership role Strong practical experience with infrastructure automation in large-scale or complex production environments Very good knowledge of Terraform, Ansible, Pulumi, Crossplane , or similar Infrastructure as Code tools Experience with GitOps , configuration management, and CI/CD practices Hands-on experience with Kubernetes Experience working with GPU-related technologies such as NVIDIA GPU Operator, device plugins, MIG, or time-slicing Good scripting or programming skills in Python, Go, or Bash Experience with bare-metal provisioning, infrastructure automation, or data center environments Good knowledge of observability tools such as Prometheus, Grafana, Loki, and OpenTelemetry Good understanding of distributed systems reliability and production incident management Experience with high-performance networking technologies such as RDMA, InfiniBand, or RoCE will be a strong advantage Ability to lead technical discussions, support team development, and communicate effectively with both technical and business stakeholders English proficiency at least at a communicative level is required, as you will be working in an international team Nice to have Experience in AI infrastructure, HPC environments, hyperscale infrastructure, or data center operations Familiarity with orchestration and scheduling tools such as Slurm, Ray, Run:ai, KServe , or Kubernetes-based schedulers Experience integrating telemetry from power, cooling, or environmental systems Experience building internal platforms or self-service tools for engineering or research teams Understanding of security, compliance, and audit requirements in regulated or security-sensitive environments What we offer Benefits package Opportunity to shape the DevOps and SRE foundation for advanced GPU infrastructure supporting AI workloads Real impact on the scalability, reliability, and operational standards of next-generation compute environments Collaboration with experienced engineers across infrastructure, platform, and AI domains A dynamic environment with space for ownership, technical leadership, and professional growth

Technology

ALTER GPU CENTER

DevOps Engineer

Mid

Remote

Łódź, Poland

🏢 Summary: Hands-on DevOps Engineer role focused on building and operating automation, deployment, and reliability standards for large-scale GPU infrastructure supporting AI training and inference. The position involves Infrastructure as Code, CI/CD, observability, security, and low-level automation across bare-metal servers, networking, storage, and Kubernetes-based platforms. The role emphasizes reliability, scalability, and automation in complex, high-performance environments. 🗂️ Requirements: 4–7 years in DevOps, SRE, or Platform Engineering, Experience with infrastructure automation in production environments, Hands-on experience with Terraform or Ansible, Experience building and maintaining CI/CD pipelines, Knowledge of GitOps practices, Understanding of infrastructure security and vulnerability management, Experience with security tools (e.g., Snyk, CrowdStrike), Practical experience with Kubernetes, Experience with GPU technologies (e.g., NVIDIA GPU Operator, MIG), Scripting or programming skills in Python, Go, or Bash, Experience with bare-metal provisioning or low-level infrastructure automation, Knowledge of observability tools (Prometheus, Grafana, Loki, OpenTelemetry) 📃 Skills: Terraform, Ansible, Kubernetes, Python, Go, Bash, Prometheus, Grafana, Loki, OpenTelemetry, Snyk, CrowdStrike, NVIDIA, MIG, CI/CD, GitOps 🏢 Description: About the role We are looking for a DevOps Engineer to help build and operate automation, deployment, and reliability standards for large-scale GPU infrastructure used for AI training and inference workloads. In this role, you will work on software-defined infrastructure supporting GPU clusters, high-performance networking, storage platforms, and internal AI services. This is a hands-on position for someone who is comfortable working close to infrastructure, improving operational processes, and building reliable automation in a complex technical environment. Responsibilities Design, implement, and maintain Infrastructure as Code solutions for provisioning and managing bare-metal GPU servers, networking, storage, and cluster orchestration components Build and improve CI/CD pipelines for infrastructure, platform services, and internal tooling Develop and maintain monitoring, logging, alerting, and observability solutions for large-scale GPU environments Support reliability initiatives by defining and tracking SLIs/SLOs , automating incident response, and contributing to post-incident analysis Automate operational tasks such as cluster scaling, firmware and BIOS updates, hardware validation, diagnostics, and capacity planning Work closely with Infrastructure, Networking, Facilities, and AI/ML teams to ensure stable and scalable platform operations Support DevSecOps practices, including infrastructure hardening, vulnerability management, and compliance automation Identify repetitive manual work and replace it with efficient automation Evaluate new tools and solutions related to GPU infrastructure, orchestration, and cloud-native operations Requirements 4–7 years of experience in DevOps, SRE, Platform Engineering , or a similar role Strong practical experience with infrastructure automation in complex production environments Good hands-on knowledge of Terraform, Ansible , or similar Infrastructure as Code tools Experience building and maintaining CI/CD pipelines and working with GitOps practices Good understanding of infrastructure security, vulnerability management, and security best practices Experience with security tools such as Snyk, CrowdStrike , or similar solutions Practical experience with Kubernetes Experience working with GPU-related technologies such as NVIDIA GPU Operator, device plugins, MIG, or time-slicing Good scripting or programming skills in Python, Go, or Bash Experience with bare-metal provisioning, low-level infrastructure automation, or data center operations Good knowledge of observability tools such as Prometheus, Grafana, Loki, and OpenTelemetry Ability to work independently, prioritize tasks, and communicate effectively with technical teams English proficiency at least at a communicative level is required, as you will be working in an international team Nice to have Experience in AI infrastructure, HPC environments, hyperscale infrastructure, or data center operations Familiarity with orchestration and scheduling tools such as Slurm, Ray, Run:ai, KServe , or Kubernetes-based schedulers Experience integrating telemetry from power, cooling, or environmental systems Experience building internal platforms or self-service tools for engineering teams Understanding of compliance and audit requirements in security-sensitive environments What we offer Benefits package Opportunity to work on advanced infrastructure supporting large-scale AI workloads Real impact on the reliability and scalability of next-generation compute environments Collaboration with experienced engineers across infrastructure, platform, and AI domains A fast-moving environment with space for ownership, technical input, and professional growth