June 8, 2026

Principal Site Reliability Engineer

Senior • Remote

Warsaw, Poland

Are you ready to lead infrastructure strategy for a cutting‑edge AI‑driven SaaS platform? We are looking for a Principal Site Reliability Engineer with a proven track record in scaling, optimizing, and securing cloud‑based systems. This senior role offers the opportunity to shape the reliability and performance of a platform used by finance teams worldwide.

In this role, you will be part of a dynamic engineering environment where your expertise will directly influence product stability and growth. You will work with advanced cloud technologies, automation tools, and AI-driven solutions, contributing to projects that push the boundaries of innovation.

If you are ready to take on strategic responsibility and make a tangible impact, apply now and join us in building the future of reliable, scalable systems.

Customer

Sigma Software is partnering with a fast‑growing AI‑driven SaaS platform serving finance and accounting teams in high‑growth businesses. The platform automates critical workflows — from billing and collections to revenue recognition and reporting, ensuring compliance and accelerating cash flow. Leveraging advanced AI, it reduces manual work, increases operational efficiency, and supports scalability for customers worldwide.

Project

The project focuses on building and scaling an AI-powered SaaS solution for finance automation. It integrates advanced machine learning models with robust cloud infrastructure to deliver secure, compliant, and high‑performance services. The engineering culture emphasizes automation, resilience, and operational excellence.

Requirements

  • At least 8 years of experience in Site Reliability Engineering or DevOps roles, including 2+ years in a Principal or Lead position

  • Proven experience in infrastructure modernization and scaling initiatives for high‑growth environments

  • Strong proficiency in Python

  • Deep expertise in cloud platforms and container orchestration tools such as AWS ECS and EKS

  • Solid experience in CI/CD pipeline design and optimization using tools like GitHub Actions and Buildkite

  • Proficiency in infrastructure‑as‑code tools such as Terraform

  • Strong knowledge of monitoring, observability, and performance optimization practices

  • Upper-Intermediate level of spoken and written English

Would be a plus

  • Experience with monorepos (Turborepo, pnpm)

  • Familiarity with modern TypeScript tools (swc, biome, oxc)

  • Knowledge of NestJS, NextJS, and testing frameworks (Jest, Vitest)

Personal Profile

  • Excellent leadership, communication, and decision‑making abilities

  • Ability to work independently and make pragmatic build‑vs‑buy decisions in fast‑paced environments

Responsibilities

  • Define and lead infrastructure and reliability strategy across the platform

  • Design scalable, resilient systems in collaboration with engineering teams

  • Optimize build, testing, and deployment processes for speed and stability

  • Establish and uphold best practices for CI/CD, monitoring, and observability

  • Lead incident response and drive continuous improvement post‑incident

  • Automate workflows to reduce operational toil and risk

  • Mentor engineers and foster a culture of operational excellence

  • Make strategic build‑vs‑buy decisions balancing speed, quality, and sustainability

Similar jobs you might like

Technology

Sigma Software

Principal Site Reliability Engineer

Senior

Remote

Bucharest, Romania

🏢 Summary: Senior Principal Site Reliability Engineer role leading infrastructure strategy for an AI-driven SaaS platform in the finance domain. Responsible for scaling, securing, and optimizing cloud-based systems while driving reliability, automation, and operational excellence. The position shapes platform performance and resilience in a high-growth, cloud-native environment. 🗂️ Requirements: 8+ years in Site Reliability Engineering or DevOps, 2+ years in Principal or Lead role, Experience in infrastructure modernization and scaling, Strong Python proficiency, Expertise with AWS cloud platforms, Experience with container orchestration (ECS, EKS), Experience designing and optimizing CI/CD pipelines, Hands-on experience with Terraform, Strong knowledge of monitoring and observability practices, Experience leading incident response and reliability improvements 📃 Skills: Python, AWS, ECS, EKS, Kubernetes, Terraform, GitHubActions, Buildkite, CICD, Monitoring, Observability 🏢 Description: Are you ready to lead infrastructure strategy for a cutting‑edge AI‑driven SaaS platform? We are looking for a Principal Site Reliability Engineer with a proven track record in scaling, optimizing, and securing cloud‑based systems. This senior role offers the opportunity to shape the reliability and performance of a platform used by finance teams worldwide. In this role, you will be part of a dynamic engineering environment where your expertise will directly influence product stability and growth. You will work with advanced cloud technologies, automation tools, and AI-driven solutions, contributing to projects that push the boundaries of innovation. If you are ready to take on strategic responsibility and make a tangible impact, apply now and join us in building the future of reliable, scalable systems. Customer Sigma Software is partnering with a fast‑growing AI‑driven SaaS platform serving finance and accounting teams in high‑growth businesses. The platform automates critical workflows — from billing and collections to revenue recognition and reporting, ensuring compliance and accelerating cash flow. Leveraging advanced AI, it reduces manual work, increases operational efficiency, and supports scalability for customers worldwide. Project The project focuses on building and scaling an AI-powered SaaS solution for finance automation. It integrates advanced machine learning models with robust cloud infrastructure to deliver secure, compliant, and high‑performance services. The engineering culture emphasizes automation, resilience, and operational excellence. Requirements At least 8 years of experience in Site Reliability Engineering or DevOps roles, including 2+ years in a Principal or Lead position Proven experience in infrastructure modernization and scaling initiatives for high‑growth environments Strong proficiency in Python Deep expertise in cloud platforms and container orchestration tools such as AWS ECS and EKS Solid experience in CI/CD pipeline design and optimization using tools like GitHub Actions and Buildkite Proficiency in infrastructure‑as‑code tools such as Terraform Strong knowledge of monitoring, observability, and performance optimization practices Upper-Intermediate level of spoken and written English Would be a plus Experience with monorepos (Turborepo, pnpm) Familiarity with modern TypeScript tools (swc, biome, oxc) Knowledge of NestJS, NextJS, and testing frameworks (Jest, Vitest) Personal Profile Excellent leadership, communication, and decision‑making abilities Ability to work independently and make pragmatic build‑vs‑buy decisions in fast‑paced environments Responsibilities Define and lead infrastructure and reliability strategy across the platform Design scalable, resilient systems in collaboration with engineering teams Optimize build, testing, and deployment processes for speed and stability Establish and uphold best practices for CI/CD, monitoring, and observability Lead incident response and drive continuous improvement post‑incident Automate workflows to reduce operational toil and risk Mentor engineers and foster a culture of operational excellence Make strategic build‑vs‑buy decisions balancing speed, quality, and sustainability

Technology

Sigma Software

Principal Site Reliability Engineer

Senior

Remote

Warsaw, Poland

🏢 Summary: Principal Site Reliability Engineer role leading infrastructure strategy for an AI-driven SaaS platform in the finance domain. The position focuses on scaling, securing, and optimizing cloud-based systems while driving automation, reliability, and performance. You will shape CI/CD, observability, and infrastructure practices in a high-growth environment. 🗂️ Requirements: 8+ years in Site Reliability Engineering or DevOps, 2+ years in Principal or Lead role, Experience in infrastructure modernization and scaling, Strong proficiency in Python, Expertise in AWS cloud platforms, Experience with AWS ECS and EKS, Experience designing and optimizing CI/CD pipelines, Experience with Terraform for infrastructure-as-code, Strong knowledge of monitoring and observability practices 📃 Skills: Python, AWS, ECS, EKS, Terraform, GitHub, Buildkite, CICD, Monitoring, Observability 🏢 Description: Are you ready to lead infrastructure strategy for a cutting‑edge AI‑driven SaaS platform? We are looking for a Principal Site Reliability Engineer with a proven track record in scaling, optimizing, and securing cloud‑based systems. This senior role offers the opportunity to shape the reliability and performance of a platform used by finance teams worldwide. In this role, you will be part of a dynamic engineering environment where your expertise will directly influence product stability and growth. You will work with advanced cloud technologies, automation tools, and AI-driven solutions, contributing to projects that push the boundaries of innovation. If you are ready to take on strategic responsibility and make a tangible impact, apply now and join us in building the future of reliable, scalable systems. Customer Sigma Software is partnering with a fast‑growing AI‑driven SaaS platform serving finance and accounting teams in high‑growth businesses. The platform automates critical workflows — from billing and collections to revenue recognition and reporting, ensuring compliance and accelerating cash flow. Leveraging advanced AI, it reduces manual work, increases operational efficiency, and supports scalability for customers worldwide. Project The project focuses on building and scaling an AI-powered SaaS solution for finance automation. It integrates advanced machine learning models with robust cloud infrastructure to deliver secure, compliant, and high‑performance services. The engineering culture emphasizes automation, resilience, and operational excellence. Requirements At least 8 years of experience in Site Reliability Engineering or DevOps roles, including 2+ years in a Principal or Lead position Proven experience in infrastructure modernization and scaling initiatives for high‑growth environments Strong proficiency in Python Deep expertise in cloud platforms and container orchestration tools such as AWS ECS and EKS Solid experience in CI/CD pipeline design and optimization using tools like GitHub Actions and Buildkite Proficiency in infrastructure‑as‑code tools such as Terraform Strong knowledge of monitoring, observability, and performance optimization practices Upper-Intermediate level of spoken and written English Would be a plus: Experience with monorepos (Turborepo, pnpm) Familiarity with modern TypeScript tools (swc, biome, oxc) Knowledge of NestJS, NextJS, and testing frameworks (Jest, Vitest) Personal Profile Excellent leadership, communication, and decision‑making abilities Ability to work independently and make pragmatic build‑vs‑buy decisions in fast‑paced environments Responsibilities Define and lead infrastructure and reliability strategy across the platform Design scalable, resilient systems in collaboration with engineering teams Optimize build, testing, and deployment processes for speed and stability Establish and uphold best practices for CI/CD, monitoring, and observability Lead incident response and drive continuous improvement post‑incident Automate workflows to reduce operational toil and risk Mentor engineers and foster a culture of operational excellence Make strategic build‑vs‑buy decisions balancing speed, quality, and sustainability

Technology

Link Group

Site Reliability Engineer

Senior

Remote

Warsaw, Poland

21,000 - 24,000 PLN

🏢 Summary: Senior Site Reliability Engineer responsible for end-to-end reliability of AI-driven applications and pipelines in production environments. Hands-on role focused on diagnosing, resolving, and automating production issues while improving monitoring and CI/CD processes. Ensures high performance, reliability, and standardized telemetry across AI systems. 🗂️ Requirements: 5+ years experience as SRE, Production Engineer, or Platform Engineer, Strong incident management and root cause analysis experience, Hands-on experience with Azure DevOps, Hands-on experience with Kubernetes, Hands-on experience with Datadog, Hands-on experience with Azure, Hands-on experience with CI/CD pipelines, Experience working in production environments, Ability to build and maintain monitoring and alerting systems 📃 Skills: Azure, Kubernetes, Datadog, AzureDevOps, CICD, Grafana, AI, LLM, Monitoring, Telemetry, RCA 🏢 Description: About the Role We are looking for a Senior Site Reliability Engineer who will take end-to-end ownership of reliability for AI-driven applications and pipelines. This is a hands-on engineering role, not a coordination or ticket-driven position. The ideal candidate actively diagnoses, resolves, and automates production issues rather than only designing solutions. Requirements 5+ years as SRE / Production / Platform Engineer Strong incident management & RCA experience Hands-on with: Azure DevOps, Kubernetes, Datadog, Azure, CI/CD Proactive, ownership mindset, self-driven Experience in production environments Nice to have: AI/LLM pipelines, Grafana Responsibilities Build and maintain monitoring, alerting, dashboards Lead incident response & root cause analysis Ensure reliability and performance of AI pipelines Standardize telemetry (latency, failures, throughput) Optimize CI/CD and release quality Reduce recurring incidents with engineering teams

Technology

Caspian One

Site Reliability Engineer

Senior

Hybrid

Krakow, Poland

1,400 - 1,800 PLN

🏢 Summary: Hands-on Site Reliability Engineer role focused on ensuring stability, scalability, and observability of a mission-critical distributed risk and analytics platform in hybrid cloud environments. The position centers on production reliability, incident response, automation, and continuous improvement of monitoring and deployment processes. You will collaborate with engineering teams to strengthen system resilience, performance, and operational standards. 🗂️ Requirements: Strong Java experience in distributed systems, Experience with observability and monitoring tools, Hands-on experience with hybrid cloud environments (preferably GCP), Experience with CI/CD pipelines and automation tools, Solid knowledge of Linux systems administration, Understanding of RDBMS fundamentals, Experience with job schedulers (e.g., Control-M), Ability to lead incident response and root-cause analysis 📃 Skills: Java, Grafana, Prometheus, Loki, OpenTelemetry, GCP, Jenkins, Ansible, Linux, SQL, Control-M, CI/CD 🏢 Description: We’re looking for a seasoned Site Reliability Engineer to support a high‑performance, mission‑critical risk and analytics platform used across global trading and finance environments. You’ll play a key role in ensuring the stability, scalability, and observability of complex distributed systems running across hybrid cloud infrastructure. In this role, you’ll take ownership of production reliability driving incident response, conducting root‑cause analysis, improving monitoring capabilities, and delivering automation that reduces operational toil. You’ll work closely with development teams, platform engineers, and service management leads to strengthen resilience, refine processes, and enhance the engineering culture around availability and performance. This is a hands on technical position suited to someone who thrives in high‑throughput environments, communicates clearly, and enjoys solving deep engineering problems in real time. Core Responsibilities Maintain and improve the reliability, uptime, and performance of distributed applications. Lead incident response, triage complex issues, coordinate recoveries, and deliver structured post‑incident reviews. Enhance observability—designing and evolving monitoring, alerting, logging, and tracing frameworks. Drive continuous improvement across automation, deployment processes, and service stability. Collaborate with cross‑functional teams to influence architecture, design, and operational standards. Support CI/CD pipelines, environment configuration, and vulnerability remediation. Contribute to a knowledge‑driven culture through documentation, tooling, and best‑practice adoption. Required Skills & Experience Strong Java background with proven experience supporting or developing distributed systems. Observability tooling expertise (Grafana, Prometheus, Loki, OpenTelemetry or similar). Hands‑on with hybrid cloud environments , ideally with GCP or another major cloud provider. CI/CD and automation experience (e.g., Jenkins, Ansible). Solid understanding of Linux , RDBMS fundamentals , and job schedulers (e.g., Control‑M or equivalents). Strong analytical mindset with a methodical approach to troubleshooting. Excellent communication skills and comfort working in Agile teams.

Technology

Link Group

Senior Site Reliability Engineer

Senior

Hybrid

Warsaw, Poland

170 - 230 PLN

🏢 Summary: The role focuses on ensuring reliability, scalability, and performance of large-scale cloud-based applications by building and maintaining resilient infrastructure. You will manage AWS cloud environments, Kubernetes clusters, and CI/CD pipelines while implementing monitoring, automation, and incident response processes. The position emphasizes Infrastructure-as-Code, observability, and continuous reliability improvements. 🗂️ Requirements: 5+ years experience in SRE, DevOps or similar role, Strong experience with AWS cloud services, Experience with Infrastructure-as-Code tools, Hands-on experience with Kubernetes, Proficiency with Docker, Experience with CI/CD pipelines, Solid knowledge of PostgreSQL or Amazon RDS, Strong SQL knowledge, Knowledge of networking concepts (VPC, DNS, troubleshooting), Strong Linux/Unix administration skills, Experience with observability tools, Experience with automation in infrastructure, Experience with incident management 📃 Skills: AWS, Terraform, Pulumi, Kubernetes, EKS, Docker, GitHub, PostgreSQL, RDS, SQL, VPC, DNS, Linux, Unix, Prometheus, Grafana, Datadog, Dynatrace, CI/CD 🏢 Description: We are looking for an experienced Site Reliability Engineer to ensure the reliability, scalability, and performance of large-scale cloud-based web applications. You will work closely with software development, cloud operations, and platform teams to build and maintain resilient infrastructure and improve system stability. Key Responsibilities: Design and maintain monitoring, alerting, and incident response systems to ensure high availability Collaborate closely with engineering, product, and architecture teams Build and manage cloud infrastructure using Infrastructure-as-Code (e.g., Terraform, Pulumi) on AWS Operate and optimize Kubernetes environments (e.g., EKS) Develop and maintain containerized applications using Docker Improve CI/CD pipelines and drive automation across deployment processes Implement and manage observability tools (logging, metrics, tracing) Participate in incident management, postmortems, and reliability improvements Support capacity planning, disaster recovery, and system scaling Contribute to security, compliance, and operational best practices Develop automation and AI-driven solutions for monitoring and incident prevention Requirements: 5+ years of experience in SRE, DevOps, or similar roles Strong experience with AWS cloud services and Infrastructure-as-Code tools Hands-on experience with Kubernetes and containerized environments Proficiency in Docker and CI/CD pipelines (e.g., GitHub Actions) Solid understanding of databases (e.g., PostgreSQL, Amazon RDS) and SQL Knowledge of networking concepts (VPC, DNS, troubleshooting tools like dig/traceroute) Strong Linux/Unix administration skills Experience with observability tools (e.g., Prometheus, Grafana, Datadog, Dynatrace) Familiarity with automation and AI-based solutions in infrastructure Strong problem-solving and incident management skills

Technology

EPAM Systems

Senior Site Reliability Engineer (SRE)

Senior

Remote

🏢 Summary: The offer is for a Site Reliability Engineer responsible for ensuring high reliability, scalability, and performance of cloud-based systems. The role focuses on implementing SRE practices, automating infrastructure, managing incidents, and enhancing monitoring and CI/CD processes. You will collaborate with cross-functional teams to optimize operations and maintain service excellence. 🗂️ Requirements: Bachelor’s degree in Computer Science, Engineering, or related field, 3+ years of experience in Site Reliability Engineering or similar role, Experience with cloud platforms (AWS, GCP, or Azure), Hands-on experience with SRE practices (SLO, SLI, error budgets, postmortems, toil reduction, capacity planning, incident management), Proficiency in Python or other scripting/programming language, Experience with monitoring tools, Experience with CI/CD tools, Experience with infrastructure as code, Experience with configuration management, Knowledge of Kubernetes and Docker, English proficiency B2 or higher 📃 Skills: AWS, GCP, Azure, Python, Kubernetes, Docker, CI/CD, Terraform, Ansible, Monitoring, SLO, SLI, Git, Bash 🏢 Description: We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our team. In this critical role, you will collaborate closely with software developers and operations teams to ensure high reliability, scalability, and efficiency of our systems, with a strong focus on meeting and exceeding customer expectations. Your expertise will be crucial in deploying, maintaining, and automating our infrastructure and application environments to ensure seamless user experiences. Your proactive involvement will be key to enhancing system reliability, optimizing resource utilization, and ensuring continuous improvement in our operational practices. Your responsibilities will include defining and tracking Service Level Objectives (SLOs), managing error budgets, and reducing toil through automation. You will play a pivotal role in driving the success of technology initiatives, maximizing their impact across the organization, and ensuring that solutions consistently meet the high standards our customers expect. Responsibilities Collaborate with development, security, quality, and operation teams to implement SRE practices and ensure system reliability Define and support required level of reliability, availability, and performance for services and applications Design and deliver Cloud-based solutions tailored to client needs Troubleshoot, mitigate, and support fixing of the infrastructure and application issues in a timely manner Implement a monitoring system for the infrastructure and application reliability Communicate technical concepts clearly to both engineering teams and management stakeholders Requirements Bachelor’s degree in Computer Science, Engineering, or a related field 3+ years of hands-on experience in Site Reliability Engineering or related roles Proven experience in any cloud (AWS/GCP/Azure) Experience with implementing SRE practices such as SLO/SLI, Error budgets, Postmortems, Reducing Toil, capacity planning, and Incident Management Python or other scripting/programming language Strong background in monitoring tools Proficiency in CI/CD tools, infrastructure as code, and configuration management Solid knowledge of container orchestration technologies (Kubernetes, Docker) English language proficiency at an Upper-Intermediate level (B2) or higher Nice to have Expertise in deployment and management of LLMs, including technologies like RAG Certification in Kubernetes, AWS/GCP/Azure, or similar technologies Proven experience in DevOps Knowledge of managing and optimizing AI/ML models in production environments, including basic deployment, monitoring, and maintenance We offer/Benefits We gather like-minded people: Engineering community of industry professionals Friendly team and enjoyable working environment Flexible schedule and opportunity to work remotely within Poland Chance to work abroad for up to 60 days annually Business-driven relocation opportunities We provide growth opportunities: Outstanding career roadmap Leadership development, career advising, soft skills, and well-being programs Certification (GCP, Azure, AWS) Unlimited access to LinkedIn Learning, Get Abstract, Cloud Guru English classes We cover it all: Stable income (Employment Contract or B2B) Participation in the Employee Stock Purchase Plan Benefits package (health insurance, multisport, shopping vouchers) Strategically located offices featuring entertainment and relaxation zones, table tennis and football, free snacks, fantastic coffee, and more Referral bonuses Corporate, social and well-being events Please, note: The set of bonuses might vary based on the role you apply for – specifics will be discussed with our recruiter during the general interview. We will reach out to selected candidates exclusively. EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Technology

emagine Polska

Senior DevOps / SRE (Platform Reliability Engineer) - French fluent

Senior

Remote

Lisbon, Portugal

🏢 Summary: Senior DevOps / SRE role focused on ensuring reliability, scalability, security, and performance of a cloud-native AWS platform. The position centers on infrastructure automation, CI/CD, Kubernetes operations, observability, and implementing SRE best practices to support highly available production systems. You will lead incident management, optimize cloud costs, and drive continuous improvement of platform resilience. 🗂️ Requirements: 5+ years in DevOps/SRE/Cloud/Platform Engineering, Strong Linux administration and troubleshooting, Production experience with Kubernetes, Experience with CI/CD tools, Expertise in Infrastructure as Code, Hands-on experience with AWS, Strong networking fundamentals, Experience with monitoring and logging tools, Scripting skills (Bash or Python) 📃 Skills: AWS, Kubernetes, Docker, Helm, Terraform, Ansible, CloudFormation, Linux, GitLab, Jenkins, GitHub, Azure, Prometheus, Grafana, ELK, Datadog, Splunk, Bash, Python, TCP/IP, DNS 🏢 Description: We are looking for a Senior DevOps / Site Reliability Engineer (SRE) to ensure the reliability, scalability, performance, and security of our platform and cloud infrastructure. You will play a key role in building and operating cloud-native systems, improving observability, automating operations, implementing SRE best practices (SLOs/SLIs), and supporting development teams to deliver highly available services. Key Responsibilities Design, implement, and maintain highly available and scalable infrastructure on AWS. Own and improve the reliability of production systems using SRE principles (SLO, SLI, error budgets). Build and manage CI/CD pipelines to support fast and safe software delivery. Develop and maintain Infrastructure as Code (IaC) using Terraform, Ansible, CloudFormation, etc. Manage and optimize container orchestration platforms (Kubernetes, Docker, Helm). Implement and maintain monitoring, logging, and alerting solutions (Prometheus, Grafana, ELK, Datadog, Splunk). Lead incident response, perform root cause analysis, and write postmortems to drive continuous improvement. Improve system performance, capacity planning, scaling strategies, and disaster recovery processes. Collaborate closely with development teams to improve deployment strategies and system resilience. Implement security best practices (IAM, secret management, vulnerability scanning, patching). Define operational standards, runbooks, documentation, and best practices for platform reliability. Participate in on-call rotation and provide senior-level support for critical production issues. Key Responsibilities (5 Main Missions) The DevOps / SRE lead will be responsible for the stability and evolution of the platform. Your role is structured around five main areas: Mission 1: AWS Infrastructure Management (Build & Run) Mission 2: CI/CD and Deployment Automation Mission 3: Monitoring, Observability, and Alerting: Global Monitoring , Log Management , Application Monitoring , Business Analytics Mission 4: Incident Management, Resilience, and Security Mission 5: FinOps and AWS Cost Optimization Key Requirements 5+ years of experience in DevOps / SRE / Cloud Infrastructure / Platform Engineering. Strong expertise in Linux systems administration and troubleshooting. Proven experience with Kubernetes in production environments. Strong experience with CI/CD tools (GitLab CI, Jenkins, GitHub Actions, Azure DevOps). Solid knowledge of Infrastructure as Code (Terraform highly preferred). Experience with AWS cloud platforms. Strong understanding of networking fundamentals (TCP/IP, DNS, load balancing, reverse proxies). Experience with observability tools: monitoring, metrics, logging, tracing. Strong scripting skills (Bash, Python, or similar). French advanced level. Nice to Have Experience with additional cloud platforms (Azure, GCP). Strong understanding of networking fundamentals.

Technology

Relativity

Senior Engineer - Site Reliability Engineering

Senior

Remote

Krakow, Poland

208,000 - 312,000 PLN/yr

🏢 Summary: Senior Software Engineer – SRE role focused on building and maintaining highly available, observable, and resilient cloud-native systems. The position emphasizes automation, CI/CD improvements, incident management, and implementation of reliability best practices across a SaaS platform. You will collaborate cross-functionally to enhance scalability, performance, and operational excellence. 🗂️ Requirements: 5+ years in Software Engineering, SRE, or Cloud Infrastructure, Experience with DevOps tools and practices, Proficiency in Python, Go, Java, or C#/.NET, Experience with at least two: GitHub, Azure DevOps, GitLab, Jenkins, Hands-on experience with observability tools, Strong experience with CI/CD pipelines and automation, Experience with cloud-native distributed systems, Experience in high-availability SaaS environments, Knowledge of SLOs, SLIs, and error budgets, Experience with incident management and root cause analysis, Experience implementing redundancy and disaster recovery, Experience with Agile methodologies 📃 Skills: Python, Go, Java, C#, DotNet, GitHub, Azure, GitLab, Jenkins, Prometheus, Grafana, OpenTelemetry, CI/CD, DevOps, SLO, SLI, SaaS, Automation, Cloud, DistributedSystems 🏢 Description: Job Overview As the Senior Software Engineer – SRE you will focus on implementing and maintaining reliability solutions across the platform. This role emphasizes hands-on engineering work, automation, and operational excellence. The Senior Software Engineer will work closely with other engineers to ensure systems are highly available, observable, and resilient. As a member of the engineering team, the Senior Software Engineer will work closely with Infrastructure, Engineering, and Product teams to develop highly resilient, observable, and automated solutions that enhance system availability and efficiency. The ideal candidate will bring deep technical expertise, strong problem-solving skills, and a passion for reliability engineering. Job Description and Requirements Job Responsibilities Implement, and advocate for best-in-class reliability, observability, and scalability practices across the platform. Develop automated solutions for system reliability, capacity planning, and incident response to minimize manual intervention. Participate in improving Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to enhance system reliability. Contribute to CI/CD pipeline improvements and DevOps practices. Support root cause analysis (RCA) investigations, drive corrective actions, and advocate for a blameless postmortem culture. Participate in on-call rotations to ensure 24/7 availability of critical systems. Influence and mentor engineering teams on SRE principles, DevOps culture, and best practices. Stay ahead of industry trends, adopting new tools, frameworks, and methodologies to continually improve system reliability. Preferred Qualifications 5+ years of experience in software engineering, site reliability engineering, or cloud infrastructure roles. Experience with DevOps tooling and practices. Proficient in building service-oriented architectures and cloud-native distributed systems. Proficiency in programming languages such as Python, Go, Java, or C# or .Net. In-depth technical understanding and experience with at least two of the following DevOps platforms: GitHub, Azure DevOps, GitLab, or Jenkins. Hands-on experience with observability tools (e.g., Prometheus, Grafana, OpenTelemetry or others). Strong background in CI/CD pipelines, automation, and DevOps practices. Experience working in global, high-availability SaaS environments. Experience implementing redundancy and disaster recovery scenarios. Excellent teamwork and cross-group collaboration skills. Ability to collaborate with both technical and business professionals. Hands-on experience with Agile Project Development Methodologies. Experience delivering complex technical solutions. Excellent problem-solving, analytical, and communication skills. Nice to have: Experience with Chaos Engineering and/or AI Ops . Competencies and Skills Automation-First Mindset – Commitment to reducing toil through scripting and automation. Reliability Engineering – Expertise in SLOs, SLIs, error budgets, and high-availability architectures. Incident Management & Postmortems – Experience in handling production incidents and driving continuous improvement. Observability & Monitoring – Deep understanding of logging, monitoring, and alerting best practices. Practical knowledge of data structures and modern data engines. Collaboration & Communication – Ability to work across teams, influence stakeholders, and advocate for reliability improvements. Mentorship & Coaching – Passion for mentoring engineers and building an SRE culture within the organization. Additional Information This role offers a unique opportunity to shape the future of SRE in a cutting-edge SaaS company, ensuring the reliability and scalability of mission-critical applications for customers worldwide. If you are passionate about solving complex reliability challenges and driving technical excellence, we’d love to hear from you! Relativity is a diverse workplace with different skills and life experiences—and we love and celebrate those differences. We believe that employees are happiest when they're empowered to be their full, authentic selves, regardless how you identify. Benefit Highlights: Comprehensive health, dental, and vision plans Parental leave for primary and secondary caregivers Flexible work arrangements Two, week-long company breaks per year Additional time off Long-term incentive program Training investment program All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, or national origin, disability or protected veteran status, or any other legally protected basis, in accordance with applicable law. Relativity is committed to competitive, fair, and equitable compensation practices. This position is eligible for total compensation which includes a competitive base salary, an annual performance bonus, and long-term incentives. The expected salary range for this role is between following values: 208 000 and 312 000PLN The final offered salary will be based on several factors, including but not limited to the candidate's depth of experience, skill set, qualifications, and internal pay equity. Hiring at the top end of the range would not be typical, to allow for future meaningful salary growth in this position. Required Skills: Automation, Data Analysis, Database Management, Network Architecture, Performance Optimizations, Problem Solving, Project Management, Software Development, System Designs, Technical Leadership

Technology

Relativity

Senior Engineer - Site Reliability Engineering

Senior

Remote

Krakow, Poland

208,000 - 312,000 PLN/yr

🏢 Summary: Remote Senior Software Engineer – SRE role focused on building and maintaining highly available, scalable, and observable cloud-native systems. The position emphasizes automation, CI/CD improvements, incident management, and implementation of reliability best practices across SaaS platforms. The engineer collaborates cross-functionally to enhance system resilience, performance, and operational excellence. 🗂️ Requirements: 5+ years in Software Engineering, SRE, or Cloud Infrastructure roles, Experience with DevOps tools and practices, Proficiency in Python, Go, Java, C#, or .Net, Experience with at least two: GitHub, Azure DevOps, GitLab, Jenkins, Hands-on experience with observability tools, Strong experience with CI/CD pipelines and automation, Experience with cloud-native distributed systems, Experience in high-availability SaaS environments, Knowledge of SLOs, SLIs, and error budgets, Experience with redundancy and disaster recovery, Participation in on-call rotations 📃 Skills: Python, Go, Java, C#, .Net, GitHub, Azure, GitLab, Jenkins, Prometheus, Grafana, OpenTelemetry, CI/CD, DevOps, SLO, SLI, SaaS, Automation, Cloud, Agile 🏢 Description: Posting Type Remote Job Overview As the Senior Software Engineer – SRE you will focus on implementing and maintaining reliability solutions across the platform. This role emphasizes hands-on engineering work, automation, and operational excellence. The Senior Software Engineer will work closely with other engineers to ensure systems are highly available, observable, and resilient. As a member of the engineering team, the Senior Software Engineer will work closely with Infrastructure, Engineering, and Product teams to develop highly resilient, observable, and automated solutions that enhance system availability and efficiency. The ideal candidate will bring deep technical expertise, strong problem-solving skills, and a passion for reliability engineering. Job Description and Requirements Job Responsibilities Implement, and advocate for best-in-class reliability, observability, and scalability practices across the platform. Develop automated solutions for system reliability, capacity planning, and incident response to minimize manual intervention. Participate in improving Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to enhance system reliability. Contribute to CI/CD pipeline improvements and DevOps practices. Support root cause analysis (RCA) investigations, drive corrective actions, and advocate for a blameless postmortem culture. Participate in on-call rotations to ensure 24/7 availability of critical systems. Influence and mentor engineering teams on SRE principles, DevOps culture, and best practices. Stay ahead of industry trends, adopting new tools, frameworks, and methodologies to continually improve system reliability. Preferred Qualifications 5+ years of experience in software engineering, site reliability engineering, or cloud infrastructure roles. Experience with DevOps tooling and practices. Proficient in building service-oriented architectures and cloud-native distributed systems. Proficiency in programming languages such as Python, Go, Java, or C# or .Net. In-depth technical understanding and experience with at least two of the following DevOps platforms: GitHub, Azure DevOps, GitLab, or Jenkins. Hands-on experience with observability tools (e.g., Prometheus, Grafana, OpenTelemetry or others). Strong background in CI/CD pipelines, automation, and DevOps practices. Experience working in global, high-availability SaaS environments. Experience implementing redundancy and disaster recovery scenarios. Excellent teamwork and cross-group collaboration skills. Ability to collaborate with both technical and business professionals. Hands-on experience with Agile Project Development Methodologies. Experience delivering complex technical solutions. Excellent problem-solving, analytical, and communication skills. Nice to have: Experience with Chaos Engineering and/or AI Ops . Competencies and Skills Automation-First Mindset – Commitment to reducing toil through scripting and automation. Reliability Engineering – Expertise in SLOs, SLIs, error budgets, and high-availability architectures. Incident Management & Postmortems – Experience in handling production incidents and driving continuous improvement. Observability & Monitoring – Deep understanding of logging, monitoring, and alerting best practices. Practical knowledge of data structures and modern data engines. Collaboration & Communication – Ability to work across teams, influence stakeholders, and advocate for reliability improvements. Mentorship & Coaching – Passion for mentoring engineers and building an SRE culture within the organization. Additional Information This role offers a unique opportunity to shape the future of SRE in a cutting-edge SaaS company, ensuring the reliability and scalability of mission-critical applications for customers worldwide. If you are passionate about solving complex reliability challenges and driving technical excellence, we’d love to hear from you! Relativity is a diverse workplace with different skills and life experiences—and we love and celebrate those differences. We believe that employees are happiest when they're empowered to be their full, authentic selves, regardless how you identify. Benefit Highlights: Comprehensive health, dental, and vision plans Parental leave for primary and secondary caregivers Flexible work arrangements Two, week-long company breaks per year Additional time off Long-term incentive program Training investment program All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, or national origin, disability or protected veteran status, or any other legally protected basis, in accordance with applicable law. Relativity is committed to competitive, fair, and equitable compensation practices. This position is eligible for total compensation which includes a competitive base salary, an annual performance bonus, and long-term incentives. The expected salary range for this role is between following values: 208 000 and 312 000PLN The final offered salary will be based on several factors, including but not limited to the candidate's depth of experience, skill set, qualifications, and internal pay equity. Hiring at the top end of the range would not be typical, to allow for future meaningful salary growth in this position. Required Skills: Automation, Data Analysis, Database Management, Network Architecture, Performance Optimizations, Problem Solving, Project Management, Software Development, System Designs, Technical Leadership

Technology

Grid Dynamics Poland

Senior Site Reliability Engineer (SRE)

Senior

Hybrid

Warsaw, Poland

100 - 128 PLN

🏢 Summary: Senior Site Reliability Engineer role focused on ensuring reliability, performance, and resilience of enterprise products by bridging infrastructure and software engineering. The position involves hands-on Java/Spring Boot code fixes, Kubernetes-based container operations, incident response, and proactive architecture improvements. The engineer drives automation, observability, and security best practices across the SDLC. 🗂️ Requirements: 5+ years experience in SRE or Platform Engineering, Strong proficiency in Java, Strong proficiency in Spring Boot, Experience with Hibernate, Experience with Jenkins, Ability to read, analyze and fix application code, Hands-on experience with Docker, Hands-on experience with Kubernetes, Deep knowledge of Linux systems, Strong understanding of networking, Experience with distributed systems, Experience with monitoring and observability tools, Bachelor’s degree in Computer Science, Systems Engineering or equivalent experience 📃 Skills: Java, Spring, Hibernate, Jenkins, Docker, Kubernetes, Linux, Networking, Prometheus, Grafana, Splunk 🏢 Description: We are looking for an experienced Senior Site Reliability Engineer to join our team and oversee the reliability, resilience, and performance of our core enterprise products. In this role, you will bridge the gap between infrastructure operations and software engineering. You won't just react to alerts - you will proactively analyze system architecture, build automation, and dive deep into the application code (Java/Spring Boot) to fix bugs and eliminate issues at their root. Responsibilities: Architecture & Reliability: Understand the end-to-end product topology from both infrastructure and application perspectives. Identify bottlenecks, scale limitations, and unstable components, driving long-term resolutions before they impact production. Incident Response & RCA: Respond to outages, provide L3 on-call technical support (on rotation), and perform blameless Root Cause Analysis (RCA) to implement permanent fixes. Hands-on Engineering: Address defects, perform code bug fixes directly in production, and recommend architectural improvements during incident analysis. Security & Vulnerability Management: Oversee vulnerability management for applications and containers, manage patching processes, ensure compliance, and monitor certificate expirations and renewals according to global best practices. SRE Advocacy & SDLC: Represent the SRE organization in design reviews, capacity planning, and operational readiness exercises. Partner closely with development teams to embed reliability best practices early in the SDLC. Automation & Mentoring: Build automation tools to reduce manual toil and improve efficiency. Spread SRE culture, create standard documentation, and provide technical mentorship to junior team members. System Health: Oversee the production environment by tracking availability, applying learnings from observability tools, and becoming a Subject Matter Expert (SME) on core issuing products. Min requirements: Experience: 5+ years of experience in Site Reliability Engineering (SRE) or Platform Engineering roles. Software Engineering: Strong proficiency in Java, Spring Boot, Hibernate , and Jenkins. Ability to read, analyze, and fix application code. Containerization: Hands-on expertise with Docker and container orchestration using Kubernetes . Infrastructure: Deep knowledge of Linux systems, networking, and distributed architectures. Observability: Strong understanding of monitoring, logging, and observability tools (e.g., Prometheus, Grafana, Splunk). Education: Bachelor’s degree in Computer Science, Systems Engineering, or equivalent practical experience. Soft Skills: Excellent problem-solving abilities and strong communication skills. Would be a plus: Infrastructure as Code & Cloud: Hands-on experience with tools like Terraform or Ansible, alongside familiarity with major public cloud providers (AWS, GCP, or Azure). Advanced Networking & Service Mesh: Knowledge of service mesh technologies (e.g., Istio, Linkerd) for traffic management, security, and observability in microservices architectures. Industry Experience: Previous background in the FinTech, payments, or banking sectors, with an understanding of high-security compliance standards (e.g., PCI-DSS). We offer: Opportunity to work on bleeding-edge projects Work with a highly motivated and dedicated team Competitive salary Flexible schedule Benefits package - medical insurance, sports Corporate social events Professional development opportunities Well-equipped office About us: Grid Dynamics (NASDAQ: GDYN) is a leading provider of technology consulting, platform and product engineering, AI, and advanced analytics services. Fusing technical vision with business acumen, we solve the most pressing technical challenges and enable positive business outcomes for enterprise companies undergoing business transformation. A key differentiator for Grid Dynamics is our 8 years of experience and leadership in enterprise AI , supported by profound expertise and ongoing investment in data , analytics , cloud & DevOps , application modernization and customer experience . Founded in 2006, Grid Dynamics is headquartered in Silicon Valley with offices across the Americas, Europe, and India.