June 8, 2026
Linux System Administrator
Mid • Remote
Łódź, Poland
About the role
We are looking for a Linux System Administrator to support the Linux environment behind large-scale GPU infrastructure used for AI training and inference workloads.
This is a hands-on role focused on the deployment, maintenance, performance tuning, and reliability of Linux-based GPU servers. You will work closely with infrastructure and platform teams to keep the environment stable, secure, and ready for demanding production workloads.
Responsibilities
Install, configure, patch, and maintain Linux operating systems across GPU-based server environments
Manage and support the NVIDIA GPU software stack, including drivers, CUDA, cuDNN, NCCL, DCGM, and MIG/time-slicing configurations
Perform system performance tuning, kernel optimization, storage configuration, and networking setup for AI/HPC workloads
Develop and maintain automation scripts and operational tooling using Python, Bash, or similar technologies
Monitor system health, investigate alerts, and troubleshoot issues across hardware, drivers, operating systems, and cluster services
Support bare-metal provisioning and integration with orchestration platforms such as Slurm or Kubernetes
Work closely with Site Operations, DevOps/SRE, and AI/ML teams to support stable GPU cluster operations and infrastructure growth
Participate in on-call support, incident response, root cause analysis, and post-incident improvement activities
Support security hardening, patch compliance, vulnerability management, and operational standards across the server fleet
Requirements
4–8 years of hands-on experience in Linux system administration in production environments
Good knowledge of enterprise Linux environments, such as Ubuntu, Debian, Red Hat Enterprise Linux, or Rocky Linux
Experience with Linux administration at scale
Practical experience with configuration management, scripting, and infrastructure automation
Good scripting skills in Python and/or Bash
Good understanding of performance tuning, storage systems, and high-speed networking technologies such as RDMA, InfiniBand, or RoCE
Experience working with NVIDIA GPUs in Linux environments, including drivers, CUDA components, and GPU monitoring tools, will be a strong advantage
Ability to troubleshoot complex technical issues in production environments
English proficiency at least at a communicative level is required, as you will be working in an international team
Nice to have
Experience in AI/ML, HPC, or large-scale data center environments
Experience with bare-metal provisioning and fleet management
Familiarity with Slurm, Kubernetes, or similar orchestration tools
Knowledge of observability tools such as Prometheus and Grafana
Familiarity with DCIM platforms
Higher education in Computer Science, Engineering, or a related field
What we offer
Benefits package
Opportunity to work on Linux infrastructure supporting advanced AI workloads
Exposure to modern GPU hardware and high-performance computing technologies
Collaboration with experienced engineers across infrastructure, platform, and AI teams
A dynamic environment with room for ownership, learning, and professional growth
Similar jobs you might like
Technology
ALTER GPU CENTER
Lead Linux System Administrator
Senior
Remote
Łódź, Poland
🏢 Summary: Lead Linux System Administrator role focused on owning and optimizing large-scale Linux-based GPU infrastructure for AI training and inference. Combines hands-on administration of NVIDIA GPU environments with team leadership and automation in a high-availability production setting. Responsible for performance, security, reliability, and lifecycle management of GPU servers. 🗂️ Requirements: 7+ years Linux system administration in production, 3+ years in technical lead or team leadership role, Expertise in Linux administration at scale, Hands-on experience with NVIDIA GPUs in Linux, Experience with CUDA ecosystem components, Experience with Ansible or other configuration management tools, Scripting skills in Python and/or Bash, Experience with Infrastructure as Code, Knowledge of high-performance computing environments, Experience with high-speed networking (InfiniBand or RoCE), Experience supporting AI/ML or HPC workloads, Ability to troubleshoot complex production issues, English proficiency (communicative level) 📃 Skills: Linux, NVIDIA, CUDA, cuDNN, NCCL, DCGM, nvidia-smi, MIG, Ansible, Terraform, Python, Bash, InfiniBand, RoCE, HPC, Slurm, Kubernetes, Run:ai 🏢 Description: About the role We are looking for a Lead Linux System Administrator to take technical ownership of the Linux environment supporting large-scale GPU infrastructure used for AI training and inference workloads. This role combines hands-on system administration with team leadership. You will be responsible for the stability, performance, security, and day-to-day management of Linux-based GPU servers, while also supporting and mentoring a team of administrators working in a complex production environment. Responsibilities Lead, mentor, and support a team of Linux System Administrators responsible for GPU infrastructure operations Manage the full Linux server lifecycle, including provisioning, patching, configuration management, hardening, and performance tuning Maintain and optimize the NVIDIA GPU software stack , including drivers, CUDA, cuDNN, NCCL, and GPU management tools such as DCGM and nvidia-smi Support and manage MIG and GPU time-slicing configurations where needed Develop and maintain automation for bare-metal provisioning, OS image management, and server configuration using tools such as Ansible, Terraform , and scripting Tune Linux systems for demanding workloads, including kernel parameters, local storage, parallel file systems, networking, and scheduler settings Troubleshoot complex issues across hardware, drivers, the operating system, and cluster-level services Work closely with DevOps/SRE, Site Operations, and AI/ML teams to ensure smooth integration between OS-level infrastructure and higher-level orchestration platforms Support security hardening, vulnerability management, patch compliance, and operational standards across the server fleet Participate in on-call support and contribute to continuous improvements in reliability, performance, and operational efficiency Requirements 7+ years of hands-on experience in Linux system administration in production environments At least 3 years of experience in a technical lead, lead administrator, or people leadership role Strong expertise in administering Linux systems at scale Hands-on experience with NVIDIA GPUs in Linux environments , including drivers, CUDA ecosystem components, and GPU management tools Strong experience with Ansible or other configuration management tools Good scripting skills in Python and/or Bash Experience with Infrastructure as Code and infrastructure automation Good understanding of high-performance computing , storage systems, and high-speed networking technologies such as InfiniBand or RoCE Experience supporting AI/ML or HPC workloads Ability to troubleshoot complex production issues and work effectively in a high-availability environment English proficiency at least at a communicative level is required, as you will be working in an international team Nice to have Experience with cluster management and orchestration tools such as Slurm, Kubernetes, or Run:ai Familiarity with bare-metal provisioning tools and large server fleet management Experience in AI infrastructure companies, hyperscalers, or HPC/research environments Knowledge of Linux performance tuning for GPU-accelerated workloads Higher education in Computer Science, Engineering, or a related field What we offer Benefits package Opportunity to lead Linux infrastructure supporting advanced AI workloads at scale Work with modern GPU hardware and software stacks in a technically demanding environment Collaboration with experienced engineers across infrastructure, platform, and AI teams A dynamic workplace with room for ownership, technical influence, and professional growth
Technology
ALTER GPU CENTER
DevOps Engineer
Mid
Remote
Łódź, Poland
🏢 Summary: Hands-on DevOps Engineer role focused on building and operating automation, deployment, and reliability standards for large-scale GPU infrastructure supporting AI training and inference. The position involves Infrastructure as Code, CI/CD, observability, security, and low-level automation across bare-metal servers, networking, storage, and Kubernetes-based platforms. The role emphasizes reliability, scalability, and automation in complex, high-performance environments. 🗂️ Requirements: 4–7 years in DevOps, SRE, or Platform Engineering, Experience with infrastructure automation in production environments, Hands-on experience with Terraform or Ansible, Experience building and maintaining CI/CD pipelines, Knowledge of GitOps practices, Understanding of infrastructure security and vulnerability management, Experience with security tools (e.g., Snyk, CrowdStrike), Practical experience with Kubernetes, Experience with GPU technologies (e.g., NVIDIA GPU Operator, MIG), Scripting or programming skills in Python, Go, or Bash, Experience with bare-metal provisioning or low-level infrastructure automation, Knowledge of observability tools (Prometheus, Grafana, Loki, OpenTelemetry) 📃 Skills: Terraform, Ansible, Kubernetes, Python, Go, Bash, Prometheus, Grafana, Loki, OpenTelemetry, Snyk, CrowdStrike, NVIDIA, MIG, CI/CD, GitOps 🏢 Description: About the role We are looking for a DevOps Engineer to help build and operate automation, deployment, and reliability standards for large-scale GPU infrastructure used for AI training and inference workloads. In this role, you will work on software-defined infrastructure supporting GPU clusters, high-performance networking, storage platforms, and internal AI services. This is a hands-on position for someone who is comfortable working close to infrastructure, improving operational processes, and building reliable automation in a complex technical environment. Responsibilities Design, implement, and maintain Infrastructure as Code solutions for provisioning and managing bare-metal GPU servers, networking, storage, and cluster orchestration components Build and improve CI/CD pipelines for infrastructure, platform services, and internal tooling Develop and maintain monitoring, logging, alerting, and observability solutions for large-scale GPU environments Support reliability initiatives by defining and tracking SLIs/SLOs , automating incident response, and contributing to post-incident analysis Automate operational tasks such as cluster scaling, firmware and BIOS updates, hardware validation, diagnostics, and capacity planning Work closely with Infrastructure, Networking, Facilities, and AI/ML teams to ensure stable and scalable platform operations Support DevSecOps practices, including infrastructure hardening, vulnerability management, and compliance automation Identify repetitive manual work and replace it with efficient automation Evaluate new tools and solutions related to GPU infrastructure, orchestration, and cloud-native operations Requirements 4–7 years of experience in DevOps, SRE, Platform Engineering , or a similar role Strong practical experience with infrastructure automation in complex production environments Good hands-on knowledge of Terraform, Ansible , or similar Infrastructure as Code tools Experience building and maintaining CI/CD pipelines and working with GitOps practices Good understanding of infrastructure security, vulnerability management, and security best practices Experience with security tools such as Snyk, CrowdStrike , or similar solutions Practical experience with Kubernetes Experience working with GPU-related technologies such as NVIDIA GPU Operator, device plugins, MIG, or time-slicing Good scripting or programming skills in Python, Go, or Bash Experience with bare-metal provisioning, low-level infrastructure automation, or data center operations Good knowledge of observability tools such as Prometheus, Grafana, Loki, and OpenTelemetry Ability to work independently, prioritize tasks, and communicate effectively with technical teams English proficiency at least at a communicative level is required, as you will be working in an international team Nice to have Experience in AI infrastructure, HPC environments, hyperscale infrastructure, or data center operations Familiarity with orchestration and scheduling tools such as Slurm, Ray, Run:ai, KServe , or Kubernetes-based schedulers Experience integrating telemetry from power, cooling, or environmental systems Experience building internal platforms or self-service tools for engineering teams Understanding of compliance and audit requirements in security-sensitive environments What we offer Benefits package Opportunity to work on advanced infrastructure supporting large-scale AI workloads Real impact on the reliability and scalability of next-generation compute environments Collaboration with experienced engineers across infrastructure, platform, and AI domains A fast-moving environment with space for ownership, technical input, and professional growth
Technology
ALTER GPU CENTER
Lead DevOps Engineer
Senior
Remote
Łódź, Poland
🏢 Summary: Technical leadership role combining hands-on DevOps/SRE engineering with team management to build and operate large-scale GPU infrastructure for AI workloads. Focused on infrastructure automation, reliability, observability, and high-performance networking across complex production environments. Responsible for shaping IaC standards, CI/CD, and operational excellence for software-defined, GPU-based platforms. 🗂️ Requirements: 8+ years in DevOps, SRE, or Platform Engineering, 3+ years in technical leadership role, Experience with large-scale infrastructure automation, Proficiency in Infrastructure as Code tools, Experience with GitOps and CI/CD, Hands-on experience with Kubernetes, Experience with GPU technologies, Scripting or programming in Python, Go, or Bash, Experience with bare-metal provisioning, Knowledge of observability and monitoring tools, Understanding of distributed systems reliability, Experience with high-performance networking technologies, Ability to lead technical discussions and mentor engineers, English proficiency at communicative level 📃 Skills: Terraform, Ansible, Pulumi, Crossplane, GitOps, Kubernetes, NVIDIA, MIG, Python, Go, Bash, Prometheus, Grafana, Loki, OpenTelemetry, RDMA, InfiniBand, RoCE, CI/CD 🏢 Description: About the role We are looking for a Lead DevOps Engineer to provide technical leadership for DevOps and Site Reliability Engineering practices supporting large-scale GPU infrastructure used for AI training and inference workloads. This role combines hands-on engineering with team leadership. You will be responsible for shaping automation standards, improving platform reliability, and leading a team working on software-defined infrastructure, high-performance networking, observability, and operational excellence across complex production environments. Responsibilities Lead, mentor, and support a team of DevOps and SRE engineers working across the full lifecycle of GPU infrastructure platforms Design and implement Infrastructure as Code solutions for provisioning and managing bare-metal GPU servers, networking, storage, and cluster orchestration components Build and improve CI/CD pipelines for infrastructure, platform services, and internal tooling Develop and maintain monitoring, logging, alerting, and observability solutions for large-scale GPU environments Define and track SLIs/SLOs , improve incident response processes, and contribute to post-incident reviews and long-term reliability improvements Work closely with Infrastructure, Networking, Facilities, and AI/ML teams to ensure stable and scalable platform operations Automate operational processes such as cluster scaling, firmware and BIOS updates, hardware diagnostics, and capacity planning Support DevSecOps practices, including infrastructure hardening, vulnerability management, and compliance automation Identify operational inefficiencies and reduce repetitive manual work through automation Evaluate and introduce new tools and solutions related to GPU infrastructure, orchestration, and cloud-native operations Requirements 8+ years of experience in DevOps, SRE, Platform Engineering , or a similar area At least 3 years of experience in a technical lead, lead engineer, or team leadership role Strong practical experience with infrastructure automation in large-scale or complex production environments Very good knowledge of Terraform, Ansible, Pulumi, Crossplane , or similar Infrastructure as Code tools Experience with GitOps , configuration management, and CI/CD practices Hands-on experience with Kubernetes Experience working with GPU-related technologies such as NVIDIA GPU Operator, device plugins, MIG, or time-slicing Good scripting or programming skills in Python, Go, or Bash Experience with bare-metal provisioning, infrastructure automation, or data center environments Good knowledge of observability tools such as Prometheus, Grafana, Loki, and OpenTelemetry Good understanding of distributed systems reliability and production incident management Experience with high-performance networking technologies such as RDMA, InfiniBand, or RoCE will be a strong advantage Ability to lead technical discussions, support team development, and communicate effectively with both technical and business stakeholders English proficiency at least at a communicative level is required, as you will be working in an international team Nice to have Experience in AI infrastructure, HPC environments, hyperscale infrastructure, or data center operations Familiarity with orchestration and scheduling tools such as Slurm, Ray, Run:ai, KServe , or Kubernetes-based schedulers Experience integrating telemetry from power, cooling, or environmental systems Experience building internal platforms or self-service tools for engineering or research teams Understanding of security, compliance, and audit requirements in regulated or security-sensitive environments What we offer Benefits package Opportunity to shape the DevOps and SRE foundation for advanced GPU infrastructure supporting AI workloads Real impact on the scalability, reliability, and operational standards of next-generation compute environments Collaboration with experienced engineers across infrastructure, platform, and AI domains A dynamic environment with space for ownership, technical leadership, and professional growth
Technology
ALTER GPU CENTER
Junior DevOps Engineer
Junior
Remote
Łódź, Poland
🏢 Summary: Junior DevOps Engineer role focused on supporting and automating cloud infrastructure for AI training and inference workloads. The position involves working with CI/CD pipelines, containerization, Infrastructure as Code, and monitoring systems in a hands-on learning environment. It offers growth in DevOps, platform engineering, and cloud operations. 🗂️ Requirements: 0–2 years in DevOps, IT operations, infrastructure, or system administration, Knowledge of Linux and terminal usage, Knowledge of Python for automation and scripting, Ability to write automation scripts, Practical knowledge of AWS and basic cloud services, Understanding of cloud infrastructure concepts, Knowledge of Docker and containerization, Basic understanding of CI/CD concepts, Ability to use Git, Basic understanding of networking, security, logging, and monitoring, Interest in Infrastructure as Code tools (Terraform or Ansible), Basic understanding of Kubernetes, Communicative English 📃 Skills: Linux, Python, AWS, Docker, CI/CD, Git, Terraform, Ansible, Kubernetes, Bash, Prometheus, Grafana, Loki, OpenTelemetry, GitHub, GitLab, Jenkins 🏢 Description: About the role We are looking for a Junior DevOps Engineer to support the development, automation, and maintenance of infrastructure used for AI training and inference workloads. In this role, you will work with experienced engineers on cloud and infrastructure automation, CI/CD pipelines, application environments, monitoring, and operational support. This is a hands-on position for someone who wants to grow in DevOps, platform engineering, cloud infrastructure, and modern operations. Responsibilities Support the maintenance and development of cloud and infrastructure environments Help prepare, maintain, and troubleshoot application environments Automate repetitive tasks using Python, Bash, and scripts Support the creation and maintenance of CI/CD pipelines Assist with Infrastructure as Code solutions for servers, networking, storage, and cluster components Monitor systems, analyze logs, and help troubleshoot technical issues Work with development, infrastructure, networking, and AI/ML teams on application deployment and platform operations Support the stability, security, and reliability of infrastructure Help identify manual processes that can be automated Document technical solutions, runbooks, and operational processes Requirements 0–2 years of experience in DevOps, IT operations, infrastructure, system administration, or a similar area Good knowledge of Linux and working in the terminal Good knowledge of Python for automation, scripts, and simple internal tools Ability to write clear scripts automating repetitive tasks Practical knowledge of AWS and basic cloud services Understanding of cloud environments and basic cloud infrastructure concepts Good knowledge of Docker and application containerization Basic understanding of CI/CD concepts Ability to work with Git Basic understanding of networking, security, logs, and monitoring Interest in Infrastructure as Code tools such as Terraform or Ansible Basic understanding of Kubernetes or willingness to learn Strong problem-solving attitude and eagerness to learn Ability to communicate clearly and work in a technical team Communicative English, as you will work in an international environment Nice to have First experience with Kubernetes, Terraform, Ansible, or GitOps Familiarity with monitoring tools such as Prometheus, Grafana, Loki, or OpenTelemetry Basic understanding of DevSecOps practices and vulnerability management Familiarity with AI infrastructure, GPU environments, HPC, or data center operations Experience with GitHub Actions, GitLab CI, Jenkins, or similar tools Interest in platform engineering, SRE, or large-scale infrastructure What we offer Benefits package Opportunity to learn from experienced infrastructure, platform, cloud, and AI engineers Work on modern infrastructure supporting AI workloads Space for professional growth in DevOps and platform engineering Remote or hybrid work from Poland
Technology
xBerry Sp. z o.o.
DevOps Engineer
Senior
Remote
Wrocław, Poland
20,000 - 28,000 PLN/mo
🏢 Summary: DevOps Engineer role focused on maintaining and enhancing a complex, on-premise automation platform deployed globally on Linux and Kubernetes. The position involves advanced troubleshooting, incident response, and development of automation, monitoring, and self-healing mechanisms to reduce on-site interventions. Includes international travel and participation in an on-call rotation to ensure high system reliability. 🗂️ Requirements: Strong Linux (Ubuntu) administration and troubleshooting experience, Hands-on Kubernetes cluster management and troubleshooting, Practical Docker experience, Solid networking knowledge and network diagnostics skills, Experience with NFS and storage troubleshooting, Operational knowledge of GPU and CUDA environments, Experience with RabbitMQ, Experience with PostgreSQL, Ability to handle production incidents and system upgrades, Willingness to participate in on-call rotation, Readiness for international travel and on-site work 📃 Skills: Linux, Ubuntu, Kubernetes, Docker, Networking, NFS, CUDA, GPU, RabbitMQ, PostgreSQL 🏢 Description: Position Overview Important: Travel & On-Call Requirements This role requires readiness for long-distance international travel to customer sites . The systems are deployed globally and, when issues cannot be resolved remotely, on-site interventions may be necessary , including deployments, upgrades, and complex troubleshooting activities. Additionally, the position includes participation in a rotational on-call / standby schedule , ensuring operational continuity and the ability to respond to critical incidents outside of standard working hours. We are looking for an experienced DevOps Engineer to join a team responsible for the maintenance and further development of a complex automation system deployed on-premise at customer sites . The system is based on Linux (Ubuntu) and a containerized Kubernetes architecture . The platform consists of multiple cooperating application and infrastructure components, including: backend services GPU-based computing components (CUDA) communication layer storage networking components The environment is characterized by high operational complexity and strong dependencies between system layers (OS, Kubernetes, applications, networking, storage). Systems are deployed across multiple locations worldwide and often operate in environments with limited local IT support, which requires high reliability and well-defined operational procedures. Responsibilities Incident Handling and System Maintenance Diagnosing and resolving issues related to: Kubernetes clusters containers (Docker) Linux (Ubuntu) operating system networking storage (including NFS) Analyzing logs and service health across application and infrastructure layers Restoring full system functionality in production environments Performing system deployments and upgrades at customer sites Participating in on-site interventions when issues cannot be resolved remotely Automation, Observability, and System Resilience Designing and developing automated troubleshooting mechanisms Early detection of infrastructure and application-level issues Automated validation of the health of key system components: OS Kubernetes containers storage networking Building health checks and observability solutions (metrics, alerts, dashboards) Creating and maintaining: runbooks standard recovery procedures automated self-healing mechanisms Documenting common incidents, root causes, and resolution methods Technical Requirements Strong experience with Linux (Ubuntu) system administration and troubleshooting Hands-on experience with Kubernetes, including cluster troubleshooting and container analysis Practical knowledge of Docker Solid understanding of networking and diagnosing network-related issues Experience with NFS / storage troubleshooting Operational knowledge of GPU / CUDA environments (compatibility, stability) Experience working with: RabbitMQ PostgreSQL Additional Requirements Willingness to participate in an on-call / standby rotation Readiness for business travel, including on-site customer visits Ability to work independently in complex, distributed environments Strong analytical and problem-solving skills We offer Flexible working hours Remote work options Medical care program MultiSport Integration events A contract of employment or self-employment, depending on You
Technology
xBerry Sp. z o.o.
DevOps Engineer
Senior
Remote
Wroclaw, Poland
20,000 - 28,000 PLN/mo
🏢 Summary: DevOps Engineer role focused on maintaining and enhancing a complex, on-premise Kubernetes-based automation platform deployed globally. The position involves advanced troubleshooting across Linux, containers, networking, storage, and GPU layers, as well as building automation and observability to reduce on-site interventions. Includes international travel and participation in an on-call rotation to support production systems. 🗂️ Requirements: Strong Linux (Ubuntu) administration and troubleshooting experience, Hands-on Kubernetes cluster management and troubleshooting, Practical Docker experience, Solid networking diagnostics skills, Experience with NFS and storage troubleshooting, Operational knowledge of GPU/CUDA environments, Experience with RabbitMQ, Experience with PostgreSQL, Ability to handle production incidents across infrastructure and application layers, Willingness to participate in on-call rotation, Readiness for international travel and on-site support 📃 Skills: Linux, Ubuntu, Kubernetes, Docker, Networking, NFS, CUDA, GPU, RabbitMQ, PostgreSQL 🏢 Description: Position Overview Important: Travel & On-Call Requirements This role requires readiness for long-distance international travel to customer sites . The systems are deployed globally and, when issues cannot be resolved remotely, on-site interventions may be necessary , including deployments, upgrades, and complex troubleshooting activities. Additionally, the position includes participation in a rotational on-call / standby schedule , ensuring operational continuity and the ability to respond to critical incidents outside of standard working hours. We are looking for an experienced DevOps Engineer to join a team responsible for the maintenance and further development of a complex automation system deployed on-premise at customer sites . The system is based on Linux (Ubuntu) and a containerized Kubernetes architecture . The platform consists of multiple cooperating application and infrastructure components, including: backend services GPU-based computing components (CUDA) communication layer storage networking components The environment is characterized by high operational complexity and strong dependencies between system layers (OS, Kubernetes, applications, networking, storage). Systems are deployed across multiple locations worldwide and often operate in environments with limited local IT support, which requires high reliability and well-defined operational procedures. The DevOps role goes beyond reactive incident handling. A key objective of the project is to systematically reduce the need for on-site interventions by developing automated monitoring, diagnostics, and recovery mechanisms. Responsibilities Incident Handling and System Maintenance Diagnosing and resolving issues related to: Kubernetes clusters containers (Docker) Linux (Ubuntu) operating system networking storage (including NFS) Analyzing logs and service health across application and infrastructure layers Restoring full system functionality in production environments Performing system deployments and upgrades at customer sites Participating in on-site interventions when issues cannot be resolved remotely Automation, Observability, and System Resilience Designing and developing automated troubleshooting mechanisms Early detection of infrastructure and application-level issues Automated validation of the health of key system components: OS Kubernetes containers storage networking Building health checks and observability solutions (metrics, alerts, dashboards) Creating and maintaining: runbooks standard recovery procedures automated self-healing mechanisms Documenting common incidents, root causes, and resolution methods Collaboration and Architecture Improvement Close cooperation with development and architecture teams Contributing to architecture simplification and standardization Improving overall system stability and reliability Supporting long-term efforts to reduce operational overhead and manual interventions Technical Requirements Strong experience with Linux (Ubuntu) system administration and troubleshooting Hands-on experience with Kubernetes, including cluster troubleshooting and container analysis Practical knowledge of Docker Solid understanding of networking and diagnosing network-related issues Experience with NFS / storage troubleshooting Operational knowledge of GPU / CUDA environments (compatibility, stability) Experience working with: RabbitMQ PostgreSQL Additional Requirements Willingness to participate in an on-call / standby rotation Readiness for business travel, including on-site customer visits Ability to work independently in complex, distributed environments Strong analytical and problem-solving skills We offer Salary: 20–28k PLN B2B base + action fee Flexible working hours Remote work options Medical care program MultiSport Integration events A contract of employment or self-employment, depending on You
Technology
Grid Dynamics Poland
Senior C++ Developer
Senior
Hybrid
Krakow, Poland
🏢 Summary: The role involves owning and maintaining critical GPU libraries (cuDNN, NCCL) within large-scale infrastructure, ensuring seamless integration across multiple GPU generations. The engineer will focus on deep debugging, performance regression analysis, build system maintenance, and automation of benchmarking and library rollouts. This position centers on systems-level C++ development at the intersection of hardware and large-scale AI/ML workloads. 🗂️ Requirements: Strong proficiency in C++ and ability to debug large-scale codebases, Deep understanding of Linux systems programming, Experience with compilation, dynamic and static linking, Knowledge of ELF binaries and shared libraries, Proficiency with Linux CLI, Advanced Bash scripting skills, Ability to triage complex technical issues and perform root-cause analysis, Clear written English (B2+) for technical documentation 📃 Skills: C++, Linux, Bash, CUDA, cuDNN, NCCL, Bazel, Blaze, CMake, ELF 🏢 Description: We are looking for a Software Engineer (C++) to join a high-stakes team supporting the GPU infrastructure of a global Tier-1 Tech Giant . If you are a systems-level expert who thrives on the intersection of hardware and software, this is the role for you. You will be the primary guardian of core GPU libraries ( cuDNN, NCCL ) that serve as the bedrock for large-scale Machine Learning and AI workloads. Your mission is to ensure that cutting-edge GPU architectures are seamlessly integrated and maintain extreme stability across massive hardware fleets, from legacy systems to the latest generations. Responsibilities: Library Integration: Own the end-to-end update lifecycle for critical GPU libraries, managing internal versioning, user migration, and alignment with open-source builds. Triage & Deep Debugging: Act as a technical detective to identify, investigate, and debug performance regressions and functional bugs; create minimal reproducers for escalation to hardware vendors or internal core teams. Build & Infrastructure Maintenance: Manage complex build/linking issues using large-scale build tools and maintain custom patches to ensure seamless integration. Validation & Benchmarking: Rigorously verify new library versions against extensive benchmarking suites and coordinate with high-profile internal partners (e.g., Autonomous Driving units) to validate siloed codebases. Automation & Evolution: Proactively improve regression benchmarking suites and develop automation tooling to reduce manual effort in library rollouts. Technical Documentation: Maintain meticulous records of investigations and keep integration playbooks updated to ensure operational excellence. Min requirements: C++ Proficiency: Strong ability to read, navigate, and debug complex, large-scale C++ codebases. Linux Systems Programming: Deep understanding of compilation, dynamic/static linking, ELF binaries, and shared library management. Systems Environments: Proficiency with Linux CLI and advanced Bash scripting for infrastructure automation. Analytical Depth: Proven track record of triaging ambiguous technical issues and conducting root-cause analysis (RCA). Communication: Clear and concise written English (B2+), essential for documenting complex technical bug reports and coordinating with cross-functional global teams. Would be a plus: Build Systems Mastery: Familiarity with large-scale build tools like Bazel, Blaze, or CMake . GPU Ecosystem: Prior exposure to CUDA, cuDNN, or NCCL architectures is highly advantageous. Hardware Awareness: Experience managing software compatibility and performance across multiple hardware generations. Tooling: Experience with internal data monitoring and bug-tracking systems in a distributed engineering environment. We offer: Opportunity to work on bleeding-edge projects Work with a highly motivated and dedicated team Competitive salary Flexible schedule Benefits package - medical insurance, sports Corporate social events Professional development opportunities Well-equipped office About us: Grid Dynamics (NASDAQ: GDYN) is a leading provider of technology consulting, platform and product engineering, AI, and advanced analytics services. Fusing technical vision with business acumen, we solve the most pressing technical challenges and enable positive business outcomes for enterprise companies undergoing business transformation. A key differentiator for Grid Dynamics is our 8 years of experience and leadership in enterprise AI , supported by profound expertise and ongoing investment in data , analytics , cloud & DevOps , application modernization and customer experience . Founded in 2006, Grid Dynamics is headquartered in Silicon Valley with offices across the Americas, Europe, and India.