June 8, 2026

Lead DevOps Engineer

Senior • Remote

Łódź, Poland

About the role

We are looking for a Lead DevOps Engineer to provide technical leadership for DevOps and Site Reliability Engineering practices supporting large-scale GPU infrastructure used for AI training and inference workloads.

This role combines hands-on engineering with team leadership. You will be responsible for shaping automation standards, improving platform reliability, and leading a team working on software-defined infrastructure, high-performance networking, observability, and operational excellence across complex production environments.

Responsibilities

  • Lead, mentor, and support a team of DevOps and SRE engineers working across the full lifecycle of GPU infrastructure platforms

  • Design and implement Infrastructure as Code solutions for provisioning and managing bare-metal GPU servers, networking, storage, and cluster orchestration components

  • Build and improve CI/CD pipelines for infrastructure, platform services, and internal tooling

  • Develop and maintain monitoring, logging, alerting, and observability solutions for large-scale GPU environments

  • Define and track SLIs/SLOs, improve incident response processes, and contribute to post-incident reviews and long-term reliability improvements

  • Work closely with Infrastructure, Networking, Facilities, and AI/ML teams to ensure stable and scalable platform operations

  • Automate operational processes such as cluster scaling, firmware and BIOS updates, hardware diagnostics, and capacity planning

  • Support DevSecOps practices, including infrastructure hardening, vulnerability management, and compliance automation

  • Identify operational inefficiencies and reduce repetitive manual work through automation

  • Evaluate and introduce new tools and solutions related to GPU infrastructure, orchestration, and cloud-native operations

Requirements

  • 8+ years of experience in DevOps, SRE, Platform Engineering, or a similar area

  • At least 3 years of experience in a technical lead, lead engineer, or team leadership role

  • Strong practical experience with infrastructure automation in large-scale or complex production environments

  • Very good knowledge of Terraform, Ansible, Pulumi, Crossplane, or similar Infrastructure as Code tools

  • Experience with GitOps, configuration management, and CI/CD practices

  • Hands-on experience with Kubernetes

  • Experience working with GPU-related technologies such as NVIDIA GPU Operator, device plugins, MIG, or time-slicing

  • Good scripting or programming skills in Python, Go, or Bash

  • Experience with bare-metal provisioning, infrastructure automation, or data center environments

  • Good knowledge of observability tools such as Prometheus, Grafana, Loki, and OpenTelemetry

  • Good understanding of distributed systems reliability and production incident management

  • Experience with high-performance networking technologies such as RDMA, InfiniBand, or RoCE will be a strong advantage

  • Ability to lead technical discussions, support team development, and communicate effectively with both technical and business stakeholders

  • English proficiency at least at a communicative level is required, as you will be working in an international team

Nice to have

  • Experience in AI infrastructure, HPC environments, hyperscale infrastructure, or data center operations

  • Familiarity with orchestration and scheduling tools such as Slurm, Ray, Run:ai, KServe, or Kubernetes-based schedulers

  • Experience integrating telemetry from power, cooling, or environmental systems

  • Experience building internal platforms or self-service tools for engineering or research teams

  • Understanding of security, compliance, and audit requirements in regulated or security-sensitive environments

What we offer

  • Benefits package

  • Opportunity to shape the DevOps and SRE foundation for advanced GPU infrastructure supporting AI workloads

  • Real impact on the scalability, reliability, and operational standards of next-generation compute environments

  • Collaboration with experienced engineers across infrastructure, platform, and AI domains

  • A dynamic environment with space for ownership, technical leadership, and professional growth

Similar jobs you might like

Technology

ALTER GPU CENTER

DevOps Engineer

Mid

Remote

Łódź, Poland

🏢 Summary: Hands-on DevOps Engineer role focused on building and operating automation, deployment, and reliability standards for large-scale GPU infrastructure supporting AI training and inference. The position involves Infrastructure as Code, CI/CD, observability, security, and low-level automation across bare-metal servers, networking, storage, and Kubernetes-based platforms. The role emphasizes reliability, scalability, and automation in complex, high-performance environments. 🗂️ Requirements: 4–7 years in DevOps, SRE, or Platform Engineering, Experience with infrastructure automation in production environments, Hands-on experience with Terraform or Ansible, Experience building and maintaining CI/CD pipelines, Knowledge of GitOps practices, Understanding of infrastructure security and vulnerability management, Experience with security tools (e.g., Snyk, CrowdStrike), Practical experience with Kubernetes, Experience with GPU technologies (e.g., NVIDIA GPU Operator, MIG), Scripting or programming skills in Python, Go, or Bash, Experience with bare-metal provisioning or low-level infrastructure automation, Knowledge of observability tools (Prometheus, Grafana, Loki, OpenTelemetry) 📃 Skills: Terraform, Ansible, Kubernetes, Python, Go, Bash, Prometheus, Grafana, Loki, OpenTelemetry, Snyk, CrowdStrike, NVIDIA, MIG, CI/CD, GitOps 🏢 Description: About the role We are looking for a DevOps Engineer to help build and operate automation, deployment, and reliability standards for large-scale GPU infrastructure used for AI training and inference workloads. In this role, you will work on software-defined infrastructure supporting GPU clusters, high-performance networking, storage platforms, and internal AI services. This is a hands-on position for someone who is comfortable working close to infrastructure, improving operational processes, and building reliable automation in a complex technical environment. Responsibilities Design, implement, and maintain Infrastructure as Code solutions for provisioning and managing bare-metal GPU servers, networking, storage, and cluster orchestration components Build and improve CI/CD pipelines for infrastructure, platform services, and internal tooling Develop and maintain monitoring, logging, alerting, and observability solutions for large-scale GPU environments Support reliability initiatives by defining and tracking SLIs/SLOs , automating incident response, and contributing to post-incident analysis Automate operational tasks such as cluster scaling, firmware and BIOS updates, hardware validation, diagnostics, and capacity planning Work closely with Infrastructure, Networking, Facilities, and AI/ML teams to ensure stable and scalable platform operations Support DevSecOps practices, including infrastructure hardening, vulnerability management, and compliance automation Identify repetitive manual work and replace it with efficient automation Evaluate new tools and solutions related to GPU infrastructure, orchestration, and cloud-native operations Requirements 4–7 years of experience in DevOps, SRE, Platform Engineering , or a similar role Strong practical experience with infrastructure automation in complex production environments Good hands-on knowledge of Terraform, Ansible , or similar Infrastructure as Code tools Experience building and maintaining CI/CD pipelines and working with GitOps practices Good understanding of infrastructure security, vulnerability management, and security best practices Experience with security tools such as Snyk, CrowdStrike , or similar solutions Practical experience with Kubernetes Experience working with GPU-related technologies such as NVIDIA GPU Operator, device plugins, MIG, or time-slicing Good scripting or programming skills in Python, Go, or Bash Experience with bare-metal provisioning, low-level infrastructure automation, or data center operations Good knowledge of observability tools such as Prometheus, Grafana, Loki, and OpenTelemetry Ability to work independently, prioritize tasks, and communicate effectively with technical teams English proficiency at least at a communicative level is required, as you will be working in an international team Nice to have Experience in AI infrastructure, HPC environments, hyperscale infrastructure, or data center operations Familiarity with orchestration and scheduling tools such as Slurm, Ray, Run:ai, KServe , or Kubernetes-based schedulers Experience integrating telemetry from power, cooling, or environmental systems Experience building internal platforms or self-service tools for engineering teams Understanding of compliance and audit requirements in security-sensitive environments What we offer Benefits package Opportunity to work on advanced infrastructure supporting large-scale AI workloads Real impact on the reliability and scalability of next-generation compute environments Collaboration with experienced engineers across infrastructure, platform, and AI domains A fast-moving environment with space for ownership, technical input, and professional growth

Technology

emagine Polska

Senior DevOps / SRE (Platform Reliability Engineer) - French fluent

Senior

Remote

Lisbon, Portugal

🏢 Summary: Senior DevOps / SRE role focused on ensuring reliability, scalability, security, and performance of a cloud-native AWS platform. The position centers on infrastructure automation, CI/CD, Kubernetes operations, observability, and implementing SRE best practices to support highly available production systems. You will lead incident management, optimize cloud costs, and drive continuous improvement of platform resilience. 🗂️ Requirements: 5+ years in DevOps/SRE/Cloud/Platform Engineering, Strong Linux administration and troubleshooting, Production experience with Kubernetes, Experience with CI/CD tools, Expertise in Infrastructure as Code, Hands-on experience with AWS, Strong networking fundamentals, Experience with monitoring and logging tools, Scripting skills (Bash or Python) 📃 Skills: AWS, Kubernetes, Docker, Helm, Terraform, Ansible, CloudFormation, Linux, GitLab, Jenkins, GitHub, Azure, Prometheus, Grafana, ELK, Datadog, Splunk, Bash, Python, TCP/IP, DNS 🏢 Description: We are looking for a Senior DevOps / Site Reliability Engineer (SRE) to ensure the reliability, scalability, performance, and security of our platform and cloud infrastructure. You will play a key role in building and operating cloud-native systems, improving observability, automating operations, implementing SRE best practices (SLOs/SLIs), and supporting development teams to deliver highly available services. Key Responsibilities Design, implement, and maintain highly available and scalable infrastructure on AWS. Own and improve the reliability of production systems using SRE principles (SLO, SLI, error budgets). Build and manage CI/CD pipelines to support fast and safe software delivery. Develop and maintain Infrastructure as Code (IaC) using Terraform, Ansible, CloudFormation, etc. Manage and optimize container orchestration platforms (Kubernetes, Docker, Helm). Implement and maintain monitoring, logging, and alerting solutions (Prometheus, Grafana, ELK, Datadog, Splunk). Lead incident response, perform root cause analysis, and write postmortems to drive continuous improvement. Improve system performance, capacity planning, scaling strategies, and disaster recovery processes. Collaborate closely with development teams to improve deployment strategies and system resilience. Implement security best practices (IAM, secret management, vulnerability scanning, patching). Define operational standards, runbooks, documentation, and best practices for platform reliability. Participate in on-call rotation and provide senior-level support for critical production issues. Key Responsibilities (5 Main Missions) The DevOps / SRE lead will be responsible for the stability and evolution of the platform. Your role is structured around five main areas: Mission 1: AWS Infrastructure Management (Build & Run) Mission 2: CI/CD and Deployment Automation Mission 3: Monitoring, Observability, and Alerting: Global Monitoring , Log Management , Application Monitoring , Business Analytics Mission 4: Incident Management, Resilience, and Security Mission 5: FinOps and AWS Cost Optimization Key Requirements 5+ years of experience in DevOps / SRE / Cloud Infrastructure / Platform Engineering. Strong expertise in Linux systems administration and troubleshooting. Proven experience with Kubernetes in production environments. Strong experience with CI/CD tools (GitLab CI, Jenkins, GitHub Actions, Azure DevOps). Solid knowledge of Infrastructure as Code (Terraform highly preferred). Experience with AWS cloud platforms. Strong understanding of networking fundamentals (TCP/IP, DNS, load balancing, reverse proxies). Experience with observability tools: monitoring, metrics, logging, tracing. Strong scripting skills (Bash, Python, or similar). French advanced level. Nice to Have Experience with additional cloud platforms (Azure, GCP). Strong understanding of networking fundamentals.

Technology

Link Group

Senior Devops Engineer

Senior

Hybrid

Warsaw, Poland

28,000 - 38,000 PLN

🏢 Summary: Senior DevOps Engineer role focused on owning and evolving cloud-native infrastructure and CI/CD platforms that support large-scale data processing systems. The position combines hands-on engineering and strategic impact to ensure scalable, secure, and reliable production environments. You will design, automate, and optimize platform services enabling efficient delivery of data-driven applications. 🗂️ Requirements: 5+ years in DevOps, SRE, or infrastructure engineering, Experience supporting distributed production systems, Hands-on experience with public cloud platforms, Strong knowledge of containerization and orchestration, Experience with infrastructure as code, Strong scripting or programming skills, Experience building and maintaining CI/CD pipelines, Knowledge of observability practices and tools, Strong troubleshooting and incident response skills in Linux environments 📃 Skills: AWS, Docker, Kubernetes, Terraform, Python, Bash, CI/CD, Linux, Monitoring, Logging, Alerting 🏢 Description: Senior DevOps Engineer We are looking for an experienced engineer to take ownership of our infrastructure and platform ecosystem, supporting large-scale data processing systems and enabling efficient, reliable software delivery. This role combines hands-on engineering with strategic impact — you will design, build, and evolve the platform that underpins data pipelines and production services, ensuring scalability, security, and operational excellence across environments. Key Responsibilities Own and evolve CI/CD and automation platforms to support fast and reliable delivery of data-driven applications Design and manage cloud-native infrastructure supporting high-volume data ingestion, processing, and serving Build and maintain infrastructure as code to ensure consistency and scalability across environments Manage containerized environments and orchestration platforms to deliver resilient and scalable services Implement observability solutions (monitoring, logging, alerting) to ensure full system visibility and reliability Automate deployment processes, configuration management, and system recovery workflows Collaborate with engineering, data, and compliance teams to deliver secure and production-ready solutions Drive incident management practices and continuous improvement initiatives Contribute to platform strategy, tooling decisions, and mentoring within the team Requirements 5+ years of experience in DevOps, SRE, or infrastructure engineering roles Strong experience supporting production systems in distributed environments Hands-on experience with public cloud platforms (AWS or similar) Solid knowledge of containerization and orchestration technologies (Docker, Kubernetes) Experience with infrastructure as code tools (e.g., Terraform) Strong scripting/programming skills (Python, Bash, or similar) Experience building and maintaining CI/CD pipelines and automation tooling Knowledge of observability practices and tools Strong troubleshooting and incident response skills in Linux environments Excellent communication skills and ability to work cross-functionally Nice to Have Experience working with large-scale data platforms Exposure to regulated environments or compliance requirements Experience contributing to platform or engineering standards

Technology

Link Group

Senior Devops Engineer

Senior

Hybrid

Warsaw, Poland

28,000 - 38,000 PLN

🏢 Summary: Senior DevOps Engineer role focused on owning and evolving cloud-native infrastructure and CI/CD platforms supporting large-scale data processing systems. The position combines hands-on engineering with strategic platform development to ensure scalable, secure, and reliable production environments. You will design, automate, and maintain infrastructure and observability solutions across distributed systems. 🗂️ Requirements: 5+ years in DevOps, SRE, or infrastructure engineering, Experience supporting production systems in distributed environments, Hands-on experience with public cloud platforms (AWS or similar), Strong knowledge of Docker and Kubernetes, Experience with infrastructure as code tools (Terraform), Strong scripting/programming skills (Python or Bash), Experience building and maintaining CI/CD pipelines, Knowledge of observability, monitoring, and logging tools, Strong troubleshooting and incident response skills in Linux environments 📃 Skills: AWS, Docker, Kubernetes, Terraform, Python, Bash, Linux, CICD, Observability, Automation, Infrastructure, Cloud 🏢 Description: Senior DevOps Engineer We are looking for an experienced engineer to take ownership of our infrastructure and platform ecosystem, supporting large-scale data processing systems and enabling efficient, reliable software delivery. This role combines hands-on engineering with strategic impact — you will design, build, and evolve the platform that underpins data pipelines and production services, ensuring scalability, security, and operational excellence across environments. Key Responsibilities Own and evolve CI/CD and automation platforms to support fast and reliable delivery of data-driven applications Design and manage cloud-native infrastructure supporting high-volume data ingestion, processing, and serving Build and maintain infrastructure as code to ensure consistency and scalability across environments Manage containerized environments and orchestration platforms to deliver resilient and scalable services Implement observability solutions (monitoring, logging, alerting) to ensure full system visibility and reliability Automate deployment processes, configuration management, and system recovery workflows Collaborate with engineering, data, and compliance teams to deliver secure and production-ready solutions Drive incident management practices and continuous improvement initiatives Contribute to platform strategy, tooling decisions, and mentoring within the team Requirements 5+ years of experience in DevOps, SRE, or infrastructure engineering roles Strong experience supporting production systems in distributed environments Hands-on experience with public cloud platforms (AWS or similar) Solid knowledge of containerization and orchestration technologies (Docker, Kubernetes) Experience with infrastructure as code tools (e.g., Terraform) Strong scripting/programming skills (Python, Bash, or similar) Experience building and maintaining CI/CD pipelines and automation tooling Knowledge of observability practices and tools Strong troubleshooting and incident response skills in Linux environments Excellent communication skills and ability to work cross-functionally Nice to Have Experience working with large-scale data platforms Exposure to regulated environments or compliance requirements Experience contributing to platform or engineering standards

Technology

emagine Polska

AI DevOps Lead Engineer

Senior

Hybrid

Krakow, Poland

200 - 200 PLN/hr

🏢 Summary: Long-term B2B opportunity for an AI DevOps Lead Engineer to design and manage cloud infrastructure on Azure and GCP, focusing on IaC, CI/CD, and Kubernetes clusters. The role centers on building secure, scalable environments, automating processes, and enhancing monitoring and security frameworks. Hybrid work model with strong emphasis on cloud-native and DevOps best practices. 🗂️ Requirements: 4+ years in DevOps, SRE, or Cloud Engineering, Strong knowledge of Linux/UNIX, Production experience with GCP and/or Azure, Experience managing GKE and/or AKS clusters, Proficiency in Terraform for Infrastructure as Code, Experience building CI/CD pipelines with Jenkins, Scripting skills in Python and/or Bash 📃 Skills: GCP, Azure, GKE, AKS, Terraform, Jenkins, Python, Bash, Linux, Kubernetes, CI/CD, IaC 🏢 Description: Introduction & Summary: We are seeking an accomplished AI DevOps Lead Engineer with a strong foundation in cloud infrastructure management and DevOps practices. The ideal candidate will have over 4 years of experience in DevOps, SRE, or Cloud Engineering roles, with a proven record of designing and implementing robust cloud solutions, particularly on GCP and Azure platforms. Strong expertise in Infrastructure as Code (IaC) and CI/CD pipeline management is imperative, along with scripting capabilities in Python and Bash. What we offer: Long Term B2B Contract Rate: 200 PLN/ H +VAT Hybrid Cracow ( 1-2 times per week) Main Responsibilities: Design, build, and manage core infrastructure on Azure using IaC principles. Administer and enhance GKE and AKS clusters, ensuring security, scalability, and resilience. Evolve Jenkins pipelines and developer tooling for improved automation and efficiency. Implement security controls and automate vulnerabilities remediation. Develop a robust monitoring and alerting framework for operational visibility. Implement security gateways for improved governance. Key Requirements: 4+ years of experience in a DevOps, SRE, or Cloud Engineering role. Deep knowledge of Linux/UNIX operating systems. Production experience with GCP and/or Azure, including management of clusters (GKE, AKS). Strong proficiency with Infrastructure as Code using Terraform. Proven experience with CI/CD pipeline development using Jenkins. Scripting and automation skills in Python and/or Bash. Nice to Have: Experience with monitoring tools like Prometheus, Grafana, and Loki. Familiarity with container management using Docker. Knowledge of additional cloud services and platforms. Other Details: This position offers the opportunity to work in a dynamic environment that fosters innovation and collaboration. Candidates can work remotely or on-site, depending on personal preferences; flexibility in schedule is provided to facilitate a productive work-life balance.

Technology

ITMAGINATION

AI Lead DevOps Engineer

Senior

Remote

Warsaw, Poland

25,575 - 29,450 PLN

🏢 Summary: Remote AI Lead DevOps Engineer role responsible for defining and executing the MLOps and CI/CD strategy for enterprise AI platforms. The position focuses on architecting secure, compliant, and fully automated ML lifecycle governance, ensuring auditability, reproducibility, and large-scale reliability. The role combines technical leadership with hands-on design of cloud-native, DevSecOps-driven AI infrastructure. 🗂️ Requirements: 8–10 years DevOps or Cloud Engineering experience, Minimum 3 years in technical leadership or architect role, Strong knowledge of end-to-end ML lifecycle, Expertise in CI/CD pipeline design and implementation, Advanced Infrastructure as Code experience, Experience with SAST and DAST implementation, Strong IAM and access control management in cloud, Ability to design observability frameworks for ML systems, Experience with configuration management in multi-cloud environments, Knowledge of database scaling and security, Experience implementing model governance and auditability practices 📃 Skills: MLOps, CI/CD, DevSecOps, Azure, AzureDevOps, GitHubActions, Jenkins, Terraform, CloudFormation, SAST, DAST, IAM, Ansible, Puppet, MySQL, PostgreSQL, MongoDB, Observability, Git 🏢 Description: This is a remote position. We are looking for an AI Lead DevOps Engineer to spearhead the MLOps strategy for our high-impact AI accounts. With 8–10 years of experience, you will provide the technical leadership necessary to design robust, compliant, and highly automated AI platforms. You aren't just managing pipelines; you are architect the entire lifecycle governance—ensuring reproducibility, audibility, and security at an enterprise scale. Key Responsibilities: Strategic Leadership: Provide technical direction for the DevOps squad, defining the CI/CD and MLOps roadmap for the account. Model Governance & Evaluation: Implement automated model evaluation pipelines to track accuracy, precision, and recall metrics in production. Enterprise Security: Lead the DevSecOps strategy, ensuring all AI deployments comply with enterprise security standards and global data regulations. Platform Enablement: Architect self-service platforms that allow ML engineers to deploy models with minimal friction while maintaining strict governance guardrails. Auditability & Reproducibility: Ensure that every ML experiment is fully auditable through sophisticated pipeline and dataset versioning strategies. Mentorship: Mentor senior and junior engineers, driving best practices in automation, IaC, and cloud-native architecture. Requirements 8–10 years of experience in DevOps/Cloud Engineering, with at least 3 years in a technical leadership or architect-level role. Deep understanding of the end-to-end ML lifecycle (training, validation, deployment, and retraining loops). Mastery across Azure DevOps, GitHub Actions, and Jenkins. Expert-level Terraform or CloudFormation skills, including modular architecture and cross-account cloud deployments. Significant experience implementing SAST/DAST tools and managing complex IAM/Access Control frameworks in a cloud environment. Ability to design custom observability frameworks that track model drift, pipeline failures, and infrastructure ROI. Advanced knowledge of configuration management tools like Ansible or Puppet for complex multi-cloud environments. Solid understanding of database scaling and security for MySQL, PostgreSQL, and MongoDB. Understanding of how DevOps practices support responsible AI (e.g., bias tracking and audit logs). Exceptional ability to collaborate with Architects and Data Scientists to translate high-level AI needs into operational reality. Native or C1-level English, with the ability to present technical strategies to senior stakeholders. Benefits Professional training programs Work with a team that’s recognized for its excellence. We’ve been featured in the Deloitte Technology Fast 50 & FT 1000 rankings. We’ve also received the Great Place To Work® certification for five years in a row

Technology

ALTER GPU CENTER

Lead Linux System Administrator

Senior

Remote

Łódź, Poland

🏢 Summary: Lead Linux System Administrator role focused on owning and optimizing large-scale Linux-based GPU infrastructure for AI training and inference. Combines hands-on administration of NVIDIA GPU environments with team leadership and automation in a high-availability production setting. Responsible for performance, security, reliability, and lifecycle management of GPU servers. 🗂️ Requirements: 7+ years Linux system administration in production, 3+ years in technical lead or team leadership role, Expertise in Linux administration at scale, Hands-on experience with NVIDIA GPUs in Linux, Experience with CUDA ecosystem components, Experience with Ansible or other configuration management tools, Scripting skills in Python and/or Bash, Experience with Infrastructure as Code, Knowledge of high-performance computing environments, Experience with high-speed networking (InfiniBand or RoCE), Experience supporting AI/ML or HPC workloads, Ability to troubleshoot complex production issues, English proficiency (communicative level) 📃 Skills: Linux, NVIDIA, CUDA, cuDNN, NCCL, DCGM, nvidia-smi, MIG, Ansible, Terraform, Python, Bash, InfiniBand, RoCE, HPC, Slurm, Kubernetes, Run:ai 🏢 Description: About the role We are looking for a Lead Linux System Administrator to take technical ownership of the Linux environment supporting large-scale GPU infrastructure used for AI training and inference workloads. This role combines hands-on system administration with team leadership. You will be responsible for the stability, performance, security, and day-to-day management of Linux-based GPU servers, while also supporting and mentoring a team of administrators working in a complex production environment. Responsibilities Lead, mentor, and support a team of Linux System Administrators responsible for GPU infrastructure operations Manage the full Linux server lifecycle, including provisioning, patching, configuration management, hardening, and performance tuning Maintain and optimize the NVIDIA GPU software stack , including drivers, CUDA, cuDNN, NCCL, and GPU management tools such as DCGM and nvidia-smi Support and manage MIG and GPU time-slicing configurations where needed Develop and maintain automation for bare-metal provisioning, OS image management, and server configuration using tools such as Ansible, Terraform , and scripting Tune Linux systems for demanding workloads, including kernel parameters, local storage, parallel file systems, networking, and scheduler settings Troubleshoot complex issues across hardware, drivers, the operating system, and cluster-level services Work closely with DevOps/SRE, Site Operations, and AI/ML teams to ensure smooth integration between OS-level infrastructure and higher-level orchestration platforms Support security hardening, vulnerability management, patch compliance, and operational standards across the server fleet Participate in on-call support and contribute to continuous improvements in reliability, performance, and operational efficiency Requirements 7+ years of hands-on experience in Linux system administration in production environments At least 3 years of experience in a technical lead, lead administrator, or people leadership role Strong expertise in administering Linux systems at scale Hands-on experience with NVIDIA GPUs in Linux environments , including drivers, CUDA ecosystem components, and GPU management tools Strong experience with Ansible or other configuration management tools Good scripting skills in Python and/or Bash Experience with Infrastructure as Code and infrastructure automation Good understanding of high-performance computing , storage systems, and high-speed networking technologies such as InfiniBand or RoCE Experience supporting AI/ML or HPC workloads Ability to troubleshoot complex production issues and work effectively in a high-availability environment English proficiency at least at a communicative level is required, as you will be working in an international team Nice to have Experience with cluster management and orchestration tools such as Slurm, Kubernetes, or Run:ai Familiarity with bare-metal provisioning tools and large server fleet management Experience in AI infrastructure companies, hyperscalers, or HPC/research environments Knowledge of Linux performance tuning for GPU-accelerated workloads Higher education in Computer Science, Engineering, or a related field What we offer Benefits package Opportunity to lead Linux infrastructure supporting advanced AI workloads at scale Work with modern GPU hardware and software stacks in a technically demanding environment Collaboration with experienced engineers across infrastructure, platform, and AI teams A dynamic workplace with room for ownership, technical influence, and professional growth

Technology

Link Group

Site Reliability Engineer

Senior

Remote

Warsaw, Poland

21,000 - 24,000 PLN

🏢 Summary: Senior Site Reliability Engineer responsible for end-to-end reliability of AI-driven applications and pipelines in production environments. Hands-on role focused on diagnosing, resolving, and automating production issues while improving monitoring and CI/CD processes. Ensures high performance, reliability, and standardized telemetry across AI systems. 🗂️ Requirements: 5+ years experience as SRE, Production Engineer, or Platform Engineer, Strong incident management and root cause analysis experience, Hands-on experience with Azure DevOps, Hands-on experience with Kubernetes, Hands-on experience with Datadog, Hands-on experience with Azure, Hands-on experience with CI/CD pipelines, Experience working in production environments, Ability to build and maintain monitoring and alerting systems 📃 Skills: Azure, Kubernetes, Datadog, AzureDevOps, CICD, Grafana, AI, LLM, Monitoring, Telemetry, RCA 🏢 Description: About the Role We are looking for a Senior Site Reliability Engineer who will take end-to-end ownership of reliability for AI-driven applications and pipelines. This is a hands-on engineering role, not a coordination or ticket-driven position. The ideal candidate actively diagnoses, resolves, and automates production issues rather than only designing solutions. Requirements 5+ years as SRE / Production / Platform Engineer Strong incident management & RCA experience Hands-on with: Azure DevOps, Kubernetes, Datadog, Azure, CI/CD Proactive, ownership mindset, self-driven Experience in production environments Nice to have: AI/LLM pipelines, Grafana Responsibilities Build and maintain monitoring, alerting, dashboards Lead incident response & root cause analysis Ensure reliability and performance of AI pipelines Standardize telemetry (latency, failures, throughput) Optimize CI/CD and release quality Reduce recurring incidents with engineering teams

Technology

Link Group

DevOps / Site Reliability Engineer

Mid

Hybrid

Kraków, Poland

20,000 - 25,000 PLN

🏢 Summary: DevOps / Site Reliability Engineer role focused on building and maintaining scalable cloud infrastructure while improving platform reliability and automation. The position centers on Kubernetes-based environments, CI/CD pipeline development, and enhancing monitoring and observability. The engineer will support development teams through infrastructure as code and internal developer platform initiatives. 🗂️ Requirements: Experience with cloud platforms (Azure preferred), Strong experience with Kubernetes, Strong knowledge of Infrastructure as Code (Terraform), Hands-on experience with CI/CD tools, Experience with monitoring and observability tools, Understanding of scalability, reliability, and security best practices 📃 Skills: Azure, Kubernetes, Terraform, GitHubActions, ArgoCD, CI/CD, Datadog, Prometheus, Grafana, MongoDB, Rancher, Jenkins, PowerBI, Jira, Confluence 🏢 Description: DevOps / Site Reliability Engineer We’re looking for a DevOps / SRE to help build and maintain scalable cloud infrastructure and improve reliability across our platform. You’ll focus on automation, CI/CD, and supporting development teams with efficient tooling and processes. Key responsibilities Develop and manage cloud infrastructure (Azure preferred) Work with Kubernetes and containerized environments Build and maintain CI/CD pipelines (GitHub Actions, ArgoCD) Automate deployments and operational processes Contribute to Internal Developer Platform (IDP) development Improve monitoring and observability (e.g., Datadog, Prometheus, Grafana) Requirements Experience with cloud platforms and Kubernetes Strong knowledge of Infrastructure as Code (e.g., Terraform) Hands-on experience with CI/CD tools Understanding of scalability, reliability, and security best practices Experience with monitoring/observability tools Nice to have Experience with MongoDB Atlas, Rancher, Jenkins, Power BI Familiarity with Jira, Confluence

Technology

emagine Polska

AI DevOps Engineer Lead

Senior

Hybrid

Krakow, Poland

190 - 200 PLN/hr

🏢 Summary: The offer is for an AI DevOps Lead Engineer responsible for designing and managing cloud infrastructure on Azure and GCP with a strong focus on Infrastructure as Code and CI/CD automation. The role involves leading infrastructure initiatives, administering Kubernetes clusters, and implementing secure, scalable, and resilient cloud solutions. It is a long-term B2B contract with a hybrid work model in Cracow. 🗂️ Requirements: 4+ years experience in DevOps, SRE, or Cloud Engineering, Strong knowledge of Linux/UNIX systems, Production experience with GCP and/or Azure, Experience managing GKE and/or AKS clusters, Proficiency in Infrastructure as Code with Terraform, Experience building and maintaining CI/CD pipelines with Jenkins, Scripting skills in Python and/or Bash 📃 Skills: GCP, Azure, GKE, AKS, Terraform, Jenkins, Python, Bash, Linux, UNIX, CI/CD, IaC 🏢 Description: We are seeking an accomplished AI DevOps Lead Engineer with a strong foundation in cloud infrastructure management and DevOps practices. The ideal candidate will have over 4 years of experience in DevOps, SRE, or Cloud Engineering roles, with a proven record of designing and implementing robust cloud solutions, particularly on GCP and Azure platforms. Strong expertise in Infrastructure as Code (IaC) and CI/CD pipeline management is imperative, along with scripting capabilities in Python and Bash. What we offer: Long Term B2B Contract Rate: 200 PLN/ H +VAT Hybrid Cracow ( 1-2 times per week) Main Responsibilities: Design, build, and manage core infrastructure on Azure using IaC principles. Administer and enhance GKE and AKS clusters, ensuring security, scalability, and resilience. Evolve Jenkins pipelines and developer tooling for improved automation and efficiency. Implement security controls and automate vulnerabilities remediation. Develop a robust monitoring and alerting framework for operational visibility. Implement security gateways for improved governance. Key Requirements: 4+ years of experience in a DevOps, SRE, or Cloud Engineering role. Deep knowledge of Linux/UNIX operating systems. Production experience with GCP and/or Azure, including management of clusters (GKE, AKS). Strong proficiency with Infrastructure as Code using Terraform. Proven experience with CI/CD pipeline development using Jenkins. Scripting and automation skills in Python and/or Bash. Nice to Have: Experience with monitoring tools like Prometheus, Grafana, and Loki. Familiarity with container management using Docker. Knowledge of additional cloud services and platforms.