June 8, 2026

Senior Site Reliability Engineer

Senior • Hybrid

170 - 230 PLN

Warsaw, Poland

We are looking for an experienced Site Reliability Engineer to ensure the reliability, scalability, and performance of large-scale cloud-based web applications. You will work closely with software development, cloud operations, and platform teams to build and maintain resilient infrastructure and improve system stability.

Key Responsibilities:

Design and maintain monitoring, alerting, and incident response systems to ensure high availability
Collaborate closely with engineering, product, and architecture teams
Build and manage cloud infrastructure using Infrastructure-as-Code (e.g., Terraform, Pulumi) on AWS
Operate and optimize Kubernetes environments (e.g., EKS)
Develop and maintain containerized applications using Docker
Improve CI/CD pipelines and drive automation across deployment processes
Implement and manage observability tools (logging, metrics, tracing)
Participate in incident management, postmortems, and reliability improvements
Support capacity planning, disaster recovery, and system scaling
Contribute to security, compliance, and operational best practices
Develop automation and AI-driven solutions for monitoring and incident prevention

Requirements:

5+ years of experience in SRE, DevOps, or similar roles
Strong experience with AWS cloud services and Infrastructure-as-Code tools
Hands-on experience with Kubernetes and containerized environments
Proficiency in Docker and CI/CD pipelines (e.g., GitHub Actions)
Solid understanding of databases (e.g., PostgreSQL, Amazon RDS) and SQL
Knowledge of networking concepts (VPC, DNS, troubleshooting tools like dig/traceroute)
Strong Linux/Unix administration skills
Experience with observability tools (e.g., Prometheus, Grafana, Datadog, Dynatrace)
Familiarity with automation and AI-based solutions in infrastructure
Strong problem-solving and incident management skills

Similar jobs you might like

Technology

Caspian One

Site Reliability Engineer

Senior

Hybrid

Krakow, Poland

1,400 - 1,800 PLN

🏢 Summary: Hands-on Site Reliability Engineer role focused on ensuring stability, scalability, and observability of a mission-critical distributed risk and analytics platform in hybrid cloud environments. The position centers on production reliability, incident response, automation, and continuous improvement of monitoring and deployment processes. You will collaborate with engineering teams to strengthen system resilience, performance, and operational standards. 🗂️ Requirements: Strong Java experience in distributed systems, Experience with observability and monitoring tools, Hands-on experience with hybrid cloud environments (preferably GCP), Experience with CI/CD pipelines and automation tools, Solid knowledge of Linux systems administration, Understanding of RDBMS fundamentals, Experience with job schedulers (e.g., Control-M), Ability to lead incident response and root-cause analysis 📃 Skills: Java, Grafana, Prometheus, Loki, OpenTelemetry, GCP, Jenkins, Ansible, Linux, SQL, Control-M, CI/CD 🏢 Description: We’re looking for a seasoned Site Reliability Engineer to support a high‑performance, mission‑critical risk and analytics platform used across global trading and finance environments. You’ll play a key role in ensuring the stability, scalability, and observability of complex distributed systems running across hybrid cloud infrastructure. In this role, you’ll take ownership of production reliability driving incident response, conducting root‑cause analysis, improving monitoring capabilities, and delivering automation that reduces operational toil. You’ll work closely with development teams, platform engineers, and service management leads to strengthen resilience, refine processes, and enhance the engineering culture around availability and performance. This is a hands on technical position suited to someone who thrives in high‑throughput environments, communicates clearly, and enjoys solving deep engineering problems in real time. Core Responsibilities Maintain and improve the reliability, uptime, and performance of distributed applications. Lead incident response, triage complex issues, coordinate recoveries, and deliver structured post‑incident reviews. Enhance observability—designing and evolving monitoring, alerting, logging, and tracing frameworks. Drive continuous improvement across automation, deployment processes, and service stability. Collaborate with cross‑functional teams to influence architecture, design, and operational standards. Support CI/CD pipelines, environment configuration, and vulnerability remediation. Contribute to a knowledge‑driven culture through documentation, tooling, and best‑practice adoption. Required Skills & Experience Strong Java background with proven experience supporting or developing distributed systems. Observability tooling expertise (Grafana, Prometheus, Loki, OpenTelemetry or similar). Hands‑on with hybrid cloud environments , ideally with GCP or another major cloud provider. CI/CD and automation experience (e.g., Jenkins, Ansible). Solid understanding of Linux , RDBMS fundamentals , and job schedulers (e.g., Control‑M or equivalents). Strong analytical mindset with a methodical approach to troubleshooting. Excellent communication skills and comfort working in Agile teams.

Technology

Link Group

Site Reliability Engineer

Mid

Hybrid

Warsaw, Poland

🏢 Summary: Hands-on Site Reliability Engineer role focused on building and scaling reliability practices across cloud and on-prem environments. The position involves improving performance, scalability, and resilience of production systems through automation, observability, and Kubernetes-based infrastructure. You will drive SRE standards and collaborate with engineering teams to enhance system stability and fault tolerance. 🗂️ Requirements: 4+ years experience in SRE, DevOps or similar roles, Strong experience with distributed systems, Strong experience with Kubernetes, Experience with AWS cloud, Hands-on automation experience with Python, Bash or Go, Solid understanding of CI/CD practices, Experience with observability and monitoring tools, Experience managing production systems 📃 Skills: Kubernetes, AWS, Python, Bash, Go, Prometheus, Grafana, CI/CD, SRE, DevOps 🏢 Description: We’re looking for a Site Reliability Engineer (SRE) to help build and scale reliability practices across our engineering organization. This is a hands-on role where you’ll work across cloud and on-prem environments, improving the performance, scalability, and resilience of critical production systems. 🔧 What you’ll be doing: • Driving SRE best practices, standards, and ways of working • Building and scaling observability & monitoring solutions (e.g. Prometheus, Grafana) • Working with Kubernetes-based infrastructure to ensure reliability and efficiency • Automating deployments, incident response, and recovery processes • Collaborating closely with engineering teams to improve system stability and fault tolerance • Contributing to a strong reliability culture (SLOs, post-mortems, continuous improvement) ✅ What we’re looking for: • 4+ years of experience in SRE / DevOps / similar roles • Strong experience with distributed systems, Kubernetes, and cloud (AWS preferred) • Hands-on approach to automation (Python, Bash, or Go) • Solid understanding of CI/CD and modern software delivery • Proactive mindset and strong ownership of production systems Name and surname*

Technology

emagine Polska

Senior DevOps / SRE (Platform Reliability Engineer) - French fluent

Senior

Remote

Lisbon, Portugal

🏢 Summary: Senior DevOps / SRE role focused on ensuring reliability, scalability, security, and performance of a cloud-native AWS platform. The position centers on infrastructure automation, CI/CD, Kubernetes operations, observability, and implementing SRE best practices to support highly available production systems. You will lead incident management, optimize cloud costs, and drive continuous improvement of platform resilience. 🗂️ Requirements: 5+ years in DevOps/SRE/Cloud/Platform Engineering, Strong Linux administration and troubleshooting, Production experience with Kubernetes, Experience with CI/CD tools, Expertise in Infrastructure as Code, Hands-on experience with AWS, Strong networking fundamentals, Experience with monitoring and logging tools, Scripting skills (Bash or Python) 📃 Skills: AWS, Kubernetes, Docker, Helm, Terraform, Ansible, CloudFormation, Linux, GitLab, Jenkins, GitHub, Azure, Prometheus, Grafana, ELK, Datadog, Splunk, Bash, Python, TCP/IP, DNS 🏢 Description: We are looking for a Senior DevOps / Site Reliability Engineer (SRE) to ensure the reliability, scalability, performance, and security of our platform and cloud infrastructure. You will play a key role in building and operating cloud-native systems, improving observability, automating operations, implementing SRE best practices (SLOs/SLIs), and supporting development teams to deliver highly available services. Key Responsibilities Design, implement, and maintain highly available and scalable infrastructure on AWS. Own and improve the reliability of production systems using SRE principles (SLO, SLI, error budgets). Build and manage CI/CD pipelines to support fast and safe software delivery. Develop and maintain Infrastructure as Code (IaC) using Terraform, Ansible, CloudFormation, etc. Manage and optimize container orchestration platforms (Kubernetes, Docker, Helm). Implement and maintain monitoring, logging, and alerting solutions (Prometheus, Grafana, ELK, Datadog, Splunk). Lead incident response, perform root cause analysis, and write postmortems to drive continuous improvement. Improve system performance, capacity planning, scaling strategies, and disaster recovery processes. Collaborate closely with development teams to improve deployment strategies and system resilience. Implement security best practices (IAM, secret management, vulnerability scanning, patching). Define operational standards, runbooks, documentation, and best practices for platform reliability. Participate in on-call rotation and provide senior-level support for critical production issues. Key Responsibilities (5 Main Missions) The DevOps / SRE lead will be responsible for the stability and evolution of the platform. Your role is structured around five main areas: Mission 1: AWS Infrastructure Management (Build & Run) Mission 2: CI/CD and Deployment Automation Mission 3: Monitoring, Observability, and Alerting: Global Monitoring , Log Management , Application Monitoring , Business Analytics Mission 4: Incident Management, Resilience, and Security Mission 5: FinOps and AWS Cost Optimization Key Requirements 5+ years of experience in DevOps / SRE / Cloud Infrastructure / Platform Engineering. Strong expertise in Linux systems administration and troubleshooting. Proven experience with Kubernetes in production environments. Strong experience with CI/CD tools (GitLab CI, Jenkins, GitHub Actions, Azure DevOps). Solid knowledge of Infrastructure as Code (Terraform highly preferred). Experience with AWS cloud platforms. Strong understanding of networking fundamentals (TCP/IP, DNS, load balancing, reverse proxies). Experience with observability tools: monitoring, metrics, logging, tracing. Strong scripting skills (Bash, Python, or similar). French advanced level. Nice to Have Experience with additional cloud platforms (Azure, GCP). Strong understanding of networking fundamentals.

Technology

Link Group

DevOps / Site Reliability Engineer

Mid

Hybrid

Kraków, Poland

20,000 - 25,000 PLN

🏢 Summary: DevOps / Site Reliability Engineer role focused on building and maintaining scalable cloud infrastructure while improving platform reliability and automation. The position centers on Kubernetes-based environments, CI/CD pipeline development, and enhancing monitoring and observability. The engineer will support development teams through infrastructure as code and internal developer platform initiatives. 🗂️ Requirements: Experience with cloud platforms (Azure preferred), Strong experience with Kubernetes, Strong knowledge of Infrastructure as Code (Terraform), Hands-on experience with CI/CD tools, Experience with monitoring and observability tools, Understanding of scalability, reliability, and security best practices 📃 Skills: Azure, Kubernetes, Terraform, GitHubActions, ArgoCD, CI/CD, Datadog, Prometheus, Grafana, MongoDB, Rancher, Jenkins, PowerBI, Jira, Confluence 🏢 Description: DevOps / Site Reliability Engineer We’re looking for a DevOps / SRE to help build and maintain scalable cloud infrastructure and improve reliability across our platform. You’ll focus on automation, CI/CD, and supporting development teams with efficient tooling and processes. Key responsibilities Develop and manage cloud infrastructure (Azure preferred) Work with Kubernetes and containerized environments Build and maintain CI/CD pipelines (GitHub Actions, ArgoCD) Automate deployments and operational processes Contribute to Internal Developer Platform (IDP) development Improve monitoring and observability (e.g., Datadog, Prometheus, Grafana) Requirements Experience with cloud platforms and Kubernetes Strong knowledge of Infrastructure as Code (e.g., Terraform) Hands-on experience with CI/CD tools Understanding of scalability, reliability, and security best practices Experience with monitoring/observability tools Nice to have Experience with MongoDB Atlas, Rancher, Jenkins, Power BI Familiarity with Jira, Confluence

Technology

Link Group

Site Reliability Engineer

Senior

Remote

Warsaw, Poland

21,000 - 24,000 PLN

🏢 Summary: Senior Site Reliability Engineer responsible for end-to-end reliability of AI-driven applications and pipelines in production environments. Hands-on role focused on diagnosing, resolving, and automating production issues while improving monitoring and CI/CD processes. Ensures high performance, reliability, and standardized telemetry across AI systems. 🗂️ Requirements: 5+ years experience as SRE, Production Engineer, or Platform Engineer, Strong incident management and root cause analysis experience, Hands-on experience with Azure DevOps, Hands-on experience with Kubernetes, Hands-on experience with Datadog, Hands-on experience with Azure, Hands-on experience with CI/CD pipelines, Experience working in production environments, Ability to build and maintain monitoring and alerting systems 📃 Skills: Azure, Kubernetes, Datadog, AzureDevOps, CICD, Grafana, AI, LLM, Monitoring, Telemetry, RCA 🏢 Description: About the Role We are looking for a Senior Site Reliability Engineer who will take end-to-end ownership of reliability for AI-driven applications and pipelines. This is a hands-on engineering role, not a coordination or ticket-driven position. The ideal candidate actively diagnoses, resolves, and automates production issues rather than only designing solutions. Requirements 5+ years as SRE / Production / Platform Engineer Strong incident management & RCA experience Hands-on with: Azure DevOps, Kubernetes, Datadog, Azure, CI/CD Proactive, ownership mindset, self-driven Experience in production environments Nice to have: AI/LLM pipelines, Grafana Responsibilities Build and maintain monitoring, alerting, dashboards Lead incident response & root cause analysis Ensure reliability and performance of AI pipelines Standardize telemetry (latency, failures, throughput) Optimize CI/CD and release quality Reduce recurring incidents with engineering teams

Technology

EPAM Systems

Senior Site Reliability Engineer (SRE)

Senior

Remote

🏢 Summary: The offer is for a Site Reliability Engineer responsible for ensuring high reliability, scalability, and performance of cloud-based systems. The role focuses on implementing SRE practices, automating infrastructure, managing incidents, and enhancing monitoring and CI/CD processes. You will collaborate with cross-functional teams to optimize operations and maintain service excellence. 🗂️ Requirements: Bachelor’s degree in Computer Science, Engineering, or related field, 3+ years of experience in Site Reliability Engineering or similar role, Experience with cloud platforms (AWS, GCP, or Azure), Hands-on experience with SRE practices (SLO, SLI, error budgets, postmortems, toil reduction, capacity planning, incident management), Proficiency in Python or other scripting/programming language, Experience with monitoring tools, Experience with CI/CD tools, Experience with infrastructure as code, Experience with configuration management, Knowledge of Kubernetes and Docker, English proficiency B2 or higher 📃 Skills: AWS, GCP, Azure, Python, Kubernetes, Docker, CI/CD, Terraform, Ansible, Monitoring, SLO, SLI, Git, Bash 🏢 Description: We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our team. In this critical role, you will collaborate closely with software developers and operations teams to ensure high reliability, scalability, and efficiency of our systems, with a strong focus on meeting and exceeding customer expectations. Your expertise will be crucial in deploying, maintaining, and automating our infrastructure and application environments to ensure seamless user experiences. Your proactive involvement will be key to enhancing system reliability, optimizing resource utilization, and ensuring continuous improvement in our operational practices. Your responsibilities will include defining and tracking Service Level Objectives (SLOs), managing error budgets, and reducing toil through automation. You will play a pivotal role in driving the success of technology initiatives, maximizing their impact across the organization, and ensuring that solutions consistently meet the high standards our customers expect. Responsibilities Collaborate with development, security, quality, and operation teams to implement SRE practices and ensure system reliability Define and support required level of reliability, availability, and performance for services and applications Design and deliver Cloud-based solutions tailored to client needs Troubleshoot, mitigate, and support fixing of the infrastructure and application issues in a timely manner Implement a monitoring system for the infrastructure and application reliability Communicate technical concepts clearly to both engineering teams and management stakeholders Requirements Bachelor’s degree in Computer Science, Engineering, or a related field 3+ years of hands-on experience in Site Reliability Engineering or related roles Proven experience in any cloud (AWS/GCP/Azure) Experience with implementing SRE practices such as SLO/SLI, Error budgets, Postmortems, Reducing Toil, capacity planning, and Incident Management Python or other scripting/programming language Strong background in monitoring tools Proficiency in CI/CD tools, infrastructure as code, and configuration management Solid knowledge of container orchestration technologies (Kubernetes, Docker) English language proficiency at an Upper-Intermediate level (B2) or higher Nice to have Expertise in deployment and management of LLMs, including technologies like RAG Certification in Kubernetes, AWS/GCP/Azure, or similar technologies Proven experience in DevOps Knowledge of managing and optimizing AI/ML models in production environments, including basic deployment, monitoring, and maintenance We offer/Benefits We gather like-minded people: Engineering community of industry professionals Friendly team and enjoyable working environment Flexible schedule and opportunity to work remotely within Poland Chance to work abroad for up to 60 days annually Business-driven relocation opportunities We provide growth opportunities: Outstanding career roadmap Leadership development, career advising, soft skills, and well-being programs Certification (GCP, Azure, AWS) Unlimited access to LinkedIn Learning, Get Abstract, Cloud Guru English classes We cover it all: Stable income (Employment Contract or B2B) Participation in the Employee Stock Purchase Plan Benefits package (health insurance, multisport, shopping vouchers) Strategically located offices featuring entertainment and relaxation zones, table tennis and football, free snacks, fantastic coffee, and more Referral bonuses Corporate, social and well-being events Please, note: The set of bonuses might vary based on the role you apply for – specifics will be discussed with our recruiter during the general interview. We will reach out to selected candidates exclusively. EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Technology

Grid Dynamics Poland

Site Reliability Engineer

Senior

Hybrid

Warsaw, Poland

🏢 Summary: Site Reliability Engineer role focused on leading the cloud platform layer of a large-scale enterprise migration to GCP, with full ownership of observability and FinOps capabilities. The position involves architecting cost attribution, distributed tracing, monitoring, and performance engineering solutions in a production-grade Kubernetes environment. You will work on complex distributed systems, extending multi-language codebases and managing infrastructure as code in a regulated enterprise setting. 🗂️ Requirements: 4–6 years software or DevOps engineering experience, 2–3 years hands-on cloud infrastructure management in production, Strong GCP expertise including GKE and Cloud Run, Proven experience building observability solutions with OpenTelemetry, Experience with distributed tracing and profiling in distributed systems, Advanced Python scripting for automation and tooling, Strong Terraform proficiency with multi-environment setups, Ability to read and modify Kotlin and Java codebases, Experience implementing monitoring, alerting, and SLOs for containerized/serverless services, Experience with infrastructure cost attribution and cloud billing APIs 📃 Skills: GCP, GKE, CloudRun, Kubernetes, OpenTelemetry, Terraform, Python, Kotlin, Java, FinOps, PubSub, Bigtable, Docker, SLO, Tracing 🏢 Description: We are looking for a Site Reliability Engineer to join a high-stakes global tech ecosystem and drive the delivery of a critical enterprise platform migration to the cloud. Your core mission will be to architect, build, and productionalize the observability and cost intelligence (FinOps) layer for a massive, multi-year financial platform transformation. You will take end-to-end ownership of the cloud platform layer, giving internal stakeholders full visibility into platform behavior, performance, and infrastructure spend. Working alongside a nearshore team of senior engineers, you will solve highly complex architectural challenges in a production-grade, distributed system. Responsibilities: End-to-End Infrastructure & FinOps Ownership: Architect and implement a cloud usage and cost attribution dashboard, providing detailed per-pod and per-service cost breakdown using cloud billing APIs and internal FinOps hubs. Advanced Observability & Tracing: Instrument end-to-end distributed tracing using OpenTelemetry, configuring collectors within Kubernetes environments and exporting traces to cloud monitoring systems utilizing RED metrics. Performance Engineering & Stress Testing: Write custom tooling from scratch to deliver database performance monitoring, load testing, and trend analysis for critical underlying storage layers. Monitoring & Alerting Automation: Build and deploy scalable production monitoring, custom alerting policies, and SLO tracking for containerized and serverless services. Infrastructure as Code: Independently manage, write, and apply infrastructure modifications using Terraform, working within established enterprise repository standards, modules, and environment state management. Cross-Language Codebase Extension: Read, debug, and extend existing platform code across a diverse stack including Kotlin, Java, and Python to seamlessly integrate technical metrics without disrupting business logic. Quality & Release Assurance: Implement rigorous unit testing with high code coverage for all newly developed monitoring tools to comply with strict enterprise quality gates and sign-offs. Min requirements: Experience: 4 to 6 years of professional software or DevOps engineering experience, with at least 2 to 3 years of hands-on cloud infrastructure management in production. Advanced Cloud Infrastructure: Deep operational proficiency with Google Cloud Platform (GCP), specifically with managing and configuring workload-level alerting on Google Kubernetes Engine (GKE) and Cloud Run. Observability & OpenTelemetry: Proven track record of building observability solutions in distributed systems, using OpenTelemetry (both auto and manual instrumentation) alongside distributed tracing and profiling tools. Strong Automation Scripting: Intermediate-to-advanced fluency in Python for writing custom test tooling, metrics integration scripts, and backend automation from scratch. Solid Infrastructure as Code: Strong proficiency in Terraform, including experience with multi-environment setups, workspaces, and corporate module standards. Polyglot & JVM Familiarity: Practical ability to read, understand, and modify existing backend codebases written in Kotlin and Java. Crucial Non-Technical Skills: Extreme technical autonomy to resolve blockers independently, rapid onboarding skills into large unfamiliar codebases, and fluent written English for async alignment and pull requests. Process Alignment: Ability to thrive in a highly regulated enterprise environment with strict peer reviews, robust documentation requirements, and formal deployment procedures. Would be a plus: Domain Knowledge: Previous experience working within financial services, fintech, investment banking, or other highly regulated industries. Enterprise Streaming Tools: Working knowledge of cloud messaging systems (such as Cloud Pub/Sub) utilized for inter-service communication. Advanced Storage Engines: Familiarity with high-throughput distributed database architectures, such as Google Cloud Bigtable. Systems Languages Awareness: Ability to read or debug foundational code written in low-level systems languages like Rust or C++ during multi-stack production deployments. We offer: Opportunity to work on bleeding-edge projects Work with a highly motivated and dedicated team Competitive salary Flexible schedule Benefits package - medical insurance, sports Corporate social events Professional development opportunities Well-equipped office About us: Grid Dynamics (NASDAQ: GDYN) is a leading provider of technology consulting, platform and product engineering, AI, and advanced analytics services. Fusing technical vision with business acumen, we solve the most pressing technical challenges and enable positive business outcomes for enterprise companies undergoing business transformation. A key differentiator for Grid Dynamics is our 8 years of experience and leadership in enterprise AI , supported by profound expertise and ongoing investment in data , analytics , cloud & DevOps , application modernization and customer experience . Founded in 2006, Grid Dynamics is headquartered in Silicon Valley with offices across the Americas, Europe, and India.

Technology

ITMAGINATION

Lead Site Reliability Engineer

Senior

Remote

Warsaw, Poland

20,150 - 21,700 PLN

🏢 Summary: Remote Lead Site Reliability Engineer role in a dynamic international tech project, focused on building, scaling, and operating modern cloud-based platforms. The position involves platform/system engineering, containerization, infrastructure as code, and multi-cloud deployments at scale. 🗂️ Requirements: 4+ years Platform/System Engineering experience, Proficiency in Go, Python, Java, or Ruby, 3+ years experience in engineering role within distributed teams, Hands-on experience with Docker and Kubernetes, Experience with GitHub Actions, ArgoCD, EKS, AKS, or ECS, Experience with Infrastructure as Code in multi-cloud environments, Experience building and operating cloud applications on AWS, Azure, or GCP, Experience with Chef, Degree in Computer Science, Computer Engineering, or related field (or equivalent experience) 📃 Skills: Go, Python, Java, Ruby, Docker, Kubernetes, GitHub, ArgoCD, EKS, AKS, ECS, AWS, Azure, GCP, Chef, Ansible, Terraform, IaC 🏢 Description: This is a remote position. Virtusa helps its Clients by becoming a true extension of their software and data development capabilities. Through the readily set up, comprehensive, and self-governing teams, we let our Clients focus on their business while we make sure that their software products and data tools scale up accordingly and with outstanding quality. We are looking for team player to fill Lead Site Reliability Engineer position in a dynamic international project for the customer from the Tech area . Requirements 4+ years of hands-on Platform/System Engineering experience using Go, Python, Java, Ruby, or equivalent programming languages. 3+ years of experience in an engineering role, working with a diverse and distributed team located across the globe. Hands-on experience with containerization technologies (Docker, Kubernetes, GitHub Actions, ArgoCD, EKS, AKS, ECS). Exposure to Infrastructure as Code (IaC) with Multi-Cloud Deployments. Proven experience building and reliably running modern full-stack cloud applications using public cloud technologies (AWS, Azure, GCP) at scale. Effective written and verbal communication skills to properly articulate complex technical problems to all levels of the organization and customers. Confidence in the ability to own and deliver a roadmap tied to business priorities. A passion for excellence, a natural problem solver, and a critical thinker who enjoys digging deep to understand issues and solve hard problems. Degree in Computer Science, Computer Engineering, or a related field (or equivalent experience). Experience with modern infrastructure management systems- Chef is must have (Ansible, Terraform). Nice to have: Expertise in building Platform-as-a-Service (PaaS) solutions. Benefits Professional training programs Work with a team that’s recognized for its excellence. We’ve been featured in the Deloitte Technology Fast 50 & FT 1000 rankings. We’ve also received the Great Place To Work® certification for five years in a row

Technology

Sigma Software

Principal Site Reliability Engineer

Senior

Remote

Warsaw, Poland

🏢 Summary: Principal Site Reliability Engineer role leading infrastructure strategy for an AI-driven SaaS platform in the finance domain. The position focuses on scaling, securing, and optimizing cloud-based systems while driving automation, reliability, and performance. You will shape CI/CD, observability, and infrastructure practices in a high-growth environment. 🗂️ Requirements: 8+ years in Site Reliability Engineering or DevOps, 2+ years in Principal or Lead role, Experience in infrastructure modernization and scaling, Strong proficiency in Python, Expertise in AWS cloud platforms, Experience with AWS ECS and EKS, Experience designing and optimizing CI/CD pipelines, Experience with Terraform for infrastructure-as-code, Strong knowledge of monitoring and observability practices 📃 Skills: Python, AWS, ECS, EKS, Terraform, GitHub, Buildkite, CICD, Monitoring, Observability 🏢 Description: Are you ready to lead infrastructure strategy for a cutting‑edge AI‑driven SaaS platform? We are looking for a Principal Site Reliability Engineer with a proven track record in scaling, optimizing, and securing cloud‑based systems. This senior role offers the opportunity to shape the reliability and performance of a platform used by finance teams worldwide. In this role, you will be part of a dynamic engineering environment where your expertise will directly influence product stability and growth. You will work with advanced cloud technologies, automation tools, and AI-driven solutions, contributing to projects that push the boundaries of innovation. If you are ready to take on strategic responsibility and make a tangible impact, apply now and join us in building the future of reliable, scalable systems. Customer Sigma Software is partnering with a fast‑growing AI‑driven SaaS platform serving finance and accounting teams in high‑growth businesses. The platform automates critical workflows — from billing and collections to revenue recognition and reporting, ensuring compliance and accelerating cash flow. Leveraging advanced AI, it reduces manual work, increases operational efficiency, and supports scalability for customers worldwide. Project The project focuses on building and scaling an AI-powered SaaS solution for finance automation. It integrates advanced machine learning models with robust cloud infrastructure to deliver secure, compliant, and high‑performance services. The engineering culture emphasizes automation, resilience, and operational excellence. Requirements At least 8 years of experience in Site Reliability Engineering or DevOps roles, including 2+ years in a Principal or Lead position Proven experience in infrastructure modernization and scaling initiatives for high‑growth environments Strong proficiency in Python Deep expertise in cloud platforms and container orchestration tools such as AWS ECS and EKS Solid experience in CI/CD pipeline design and optimization using tools like GitHub Actions and Buildkite Proficiency in infrastructure‑as‑code tools such as Terraform Strong knowledge of monitoring, observability, and performance optimization practices Upper-Intermediate level of spoken and written English Would be a plus: Experience with monorepos (Turborepo, pnpm) Familiarity with modern TypeScript tools (swc, biome, oxc) Knowledge of NestJS, NextJS, and testing frameworks (Jest, Vitest) Personal Profile Excellent leadership, communication, and decision‑making abilities Ability to work independently and make pragmatic build‑vs‑buy decisions in fast‑paced environments Responsibilities Define and lead infrastructure and reliability strategy across the platform Design scalable, resilient systems in collaboration with engineering teams Optimize build, testing, and deployment processes for speed and stability Establish and uphold best practices for CI/CD, monitoring, and observability Lead incident response and drive continuous improvement post‑incident Automate workflows to reduce operational toil and risk Mentor engineers and foster a culture of operational excellence Make strategic build‑vs‑buy decisions balancing speed, quality, and sustainability

Technology

GlobalTech Poland Sp z O. O.

Site Reliability Engineer

Mid

Hybrid

Warsaw, Poland

🏢 Summary: The offer is for a Site Reliability Engineer in the API Operations team, responsible for monitoring, diagnosing, and resolving production incidents across Apigee API implementations. The role focuses on ensuring stability, reliability, and performance of deployed APIs through observability, automation, and close collaboration with engineering and platform teams. It involves incident management, root cause analysis, and continuous improvement of production systems. 🗂️ Requirements: 3+ years in production support, SRE, or DevOps, Experience supporting Apigee APIs, Strong knowledge of cloud infrastructure (GCP preferred), Proficiency in Python or shell scripting, Proficiency in Python and JavaScript, Experience with observability and monitoring tools, Incident management and root cause analysis experience, Bachelor’s degree in Computer Science, Engineering, or related field, English B2-C1, Czech or Polish B1-B2 📃 Skills: Apigee, GCP, Python, JavaScript, Shell, CI/CD, Monitoring, Observability, Logging, Automation, APIs 🏢 Description: Summary of This Role We are looking for a detail-oriented and technically strong Site Reliability Engineer to join our API Operations team. In this critical role, you will be responsible for monitoring, diagnosing, and resolving production incidents across our Apigee API Implementations. You’ll work closely with API engineering, Developer Services, Product Management, platform, and governance teams to ensure the stability, reliability, and performance of deployed models and agentic solutions across the enterprise. You will join a dynamic team passionate about learning, applying cutting-edge and cost effective technologies, and innovating to deliver high-quality, and highly available API solutions. What Part Will You Play? Serve as the first line of defense for production incidents, ensuring rapid triage, root cause analysis, and resolution. Monitor system health and performance of deployed APIs and integrating applications Track and investigate issues related to latency, failures, or broken integrations, escalating to the API engineering group where appropriate. Collaborate with platform engineers to implement observability, logging, and alerting best practices for API services Build diagnostic tools, runbooks, and automated workflows to improve incident response time and reduce manual intervention. Maintain knowledge bases and playbooks for repeatable troubleshooting and knowledge transfer. Partner with governance and compliance teams to ensure incidents are documented and remediated in line with internal policy. Contribute to retrospectives and continuous improvement efforts to harden production systems. What Are We Looking For in This Role? Minimum Qualifications 3+ years of experience in production support, site reliability engineering (SRE), or DevOps—preferably supporting Apigee APIs. Strong understanding of cloud infrastructure ( GCP preferred) and observability tools Proficiency in Python or shell scripting for automation and troubleshooting. Strong analytical, communication, and incident management skills. Bachelor’s degree in Computer Science, Engineering, or a related field. Proficiency in programming languages such as Python and JavaScript Excellent problem-solving and analytical skills. Excellent communication and collaboration skills. English proficiency at B2-C1 level and Czech/ Polish proficiency at B1-B2 level. Preferred Qualifications Experience with CI/CD tools and Alerts/Monitoring automation Familiarity with API integrations. What Are Our Desired Skills and Capabilities? Ability to work proactively with a high level of initiative and accuracy. Ability to manage multiple assignments effectively and meet established deadlines. Strong interpersonal skills to interact professionally with staff and stakeholders. Excellent organizational skills and attention to detail. Critical thinking ability ranging from moderately to highly complex tasks. Flexibility in adapting to changing business needs and priorities. Ability to work creatively and independently with minimal supervision. Ability to utilize experience and judgment in accomplishing goals. Experience in navigating organizational structures and collaborating across teams. What will you get from us: working in a global environment with international market-focused projects using English language on daily base private medical care onboarding training in first days of work – you will get to know our company better training for employees: with us you will develop your professional and personal potential lunch pass/Pluxee multisport cards at preferential prices possibility to join a group UNUM life insurance fresh fruits every Wednesday and delicious coffee from Praska Palarnia every

Link Group

Link Group is a company operating in the technology industry, focusing on the development of high-quality web applications. The company emphasizes creating intuitive user interfaces and robust backend services, particularly using technologies like React JS and Kotlin. Link Group values collaboration, as evidenced by its emphasis on teamwork with architects, designers, and cross-functional teams. The company is committed to improving user experience and interface design through iterative development and feedback, highlighting its dedication to innovation and quality in its products.

Check if your resume is ATS-ready before applying →Build an ATS-optimized resume