June 8, 2026

Site Reliability Engineer (SRE)

Senior • Hybrid

40,000 - 55,000 PLN

Warsaw, Poland

About the Client

Our client is a premier, global investment management firm operating at the intersection of finance and technology. Known for their sophisticated, data-intensive systems, they build and maintain high-performance platforms that process massive volumes of market and operational data.

To support their expanding footprint, they are looking for a senior-level Site Reliability Engineer (SRE) who will take ownership of shaping, standardizing, and scaling their SRE frameworks and reliability culture from the ground up.

The Role

In this role, you will serve as a foundational force for SRE practices, partnering directly with Cloud, Infrastructure, and Software Engineering squads. You will work across a hybrid infrastructure (combining advanced AWS cloud environments and physical on-premises servers) to guarantee the scalability, resilience, and maximum uptime of critical, high-frequency transactional platforms.

Core Responsibilities

SRE Evangelism: Design, implement, and champion core reliability principles, helping technology teams adopt sustainable scaling practices.
Observability Architecture: Implement, scale, and maintain end-to-end monitoring, telemetry, and distributed tracing systems utilizing Prometheus, Grafana, Loki, and Tempo (OpenTelemetry framework).
Kubernetes Optimization: Establish best-practice configurations for containerized workloads, ensuring applications running on Kubernetes are highly resilient, cost-effective, and performant.
Incident Management & Culture: Participate in a balanced, shared on-call rotation (averaging one week per month).
Automation & Engineering: Build custom tooling and CI/CD pipelines to automate routine tasks, system health checks, and rapid disaster recovery workflows.
SLO/SLA Definition: Partner with product and engineering teams to define, monitor, and enforce Service Level Objectives (SLOs) and Error Budgets.

What We Look For

Experience: 5+ years of hands-on experience in a dedicated SRE, DevOps, or Infrastructure Engineering role supporting complex, distributed production systems.
Education: A Bachelor’s degree in Computer Science, Computer Engineering, or a related technical discipline (or equivalent practical experience).
Observability Expertise: Deep, subject-matter knowledge of modern monitoring stacks, specifically Grafana, Prometheus, Loki, and Tempo (OTel).
Orchestration & Containers: Strong, production-grade expertise in containerization (Docker) and orchestration (Kubernetes).
Hybrid Infrastructure: Experience navigating hybrid models—managing both cloud services (AWS preferred) and physical on-premise hardware resources.
Scripting/Coding: Proficiency in writing clean, maintainable code in at least one scripting or programming language (e.g., Python, Bash, or Go) to build reliable automation.
Methodologies: Solid grounding in CI/CD concepts, infrastructure-as-code (IaC), and agile development processes.
Soft Skills: Excellent verbal and written communication skills, with a proven ability to convey complex infrastructure and reliability concepts to both technical and non-technical stakeholders.

What We Offer

Stable Employment: Full-time employment contract (Umowa o Pracę - UoP).
Tax Optimization: Eligibility for creative tax-deductible costs (KUP - Koszty Uzyskania Przychodu).
Financial Reward: Highly competitive base salary accompanied by a generous annual performance bonus.
Comprehensive Health: Premium private medical care package that fully includes dental coverage (stomatologia).
Wellness & Lifestyle: MultiSport card to keep you active and healthy.
Daily Perks: Pre-funded lunch card for your daily meals.

Tech Stack at a Glance

Cloud & Virtualization: AWS, Kubernetes, Docker, On-Premises Hypervisors
Observability: Prometheus, Grafana, Loki, Tempo, OpenTelemetry (OTel)
Languages: Python, Go, Bash
CI/CD & Automation: Git-based pipelines, Configuration Management, IaC

Similar jobs you might like

Technology

Link Group

Site Reliability Engineer

Mid

Hybrid

Warsaw, Poland

🏢 Summary: Hands-on Site Reliability Engineer role focused on building and scaling reliability practices across cloud and on-prem environments. The position involves improving performance, scalability, and resilience of production systems through automation, observability, and Kubernetes-based infrastructure. You will drive SRE standards and collaborate with engineering teams to enhance system stability and fault tolerance. 🗂️ Requirements: 4+ years experience in SRE, DevOps or similar roles, Strong experience with distributed systems, Strong experience with Kubernetes, Experience with AWS cloud, Hands-on automation experience with Python, Bash or Go, Solid understanding of CI/CD practices, Experience with observability and monitoring tools, Experience managing production systems 📃 Skills: Kubernetes, AWS, Python, Bash, Go, Prometheus, Grafana, CI/CD, SRE, DevOps 🏢 Description: We’re looking for a Site Reliability Engineer (SRE) to help build and scale reliability practices across our engineering organization. This is a hands-on role where you’ll work across cloud and on-prem environments, improving the performance, scalability, and resilience of critical production systems. 🔧 What you’ll be doing: • Driving SRE best practices, standards, and ways of working • Building and scaling observability & monitoring solutions (e.g. Prometheus, Grafana) • Working with Kubernetes-based infrastructure to ensure reliability and efficiency • Automating deployments, incident response, and recovery processes • Collaborating closely with engineering teams to improve system stability and fault tolerance • Contributing to a strong reliability culture (SLOs, post-mortems, continuous improvement) ✅ What we’re looking for: • 4+ years of experience in SRE / DevOps / similar roles • Strong experience with distributed systems, Kubernetes, and cloud (AWS preferred) • Hands-on approach to automation (Python, Bash, or Go) • Solid understanding of CI/CD and modern software delivery • Proactive mindset and strong ownership of production systems Name and surname*

Technology

emagine Polska

Senior DevOps / SRE (Platform Reliability Engineer) - French fluent

Senior

Remote

Lisbon, Portugal

🏢 Summary: Senior DevOps / SRE role focused on ensuring reliability, scalability, security, and performance of a cloud-native AWS platform. The position centers on infrastructure automation, CI/CD, Kubernetes operations, observability, and implementing SRE best practices to support highly available production systems. You will lead incident management, optimize cloud costs, and drive continuous improvement of platform resilience. 🗂️ Requirements: 5+ years in DevOps/SRE/Cloud/Platform Engineering, Strong Linux administration and troubleshooting, Production experience with Kubernetes, Experience with CI/CD tools, Expertise in Infrastructure as Code, Hands-on experience with AWS, Strong networking fundamentals, Experience with monitoring and logging tools, Scripting skills (Bash or Python) 📃 Skills: AWS, Kubernetes, Docker, Helm, Terraform, Ansible, CloudFormation, Linux, GitLab, Jenkins, GitHub, Azure, Prometheus, Grafana, ELK, Datadog, Splunk, Bash, Python, TCP/IP, DNS 🏢 Description: We are looking for a Senior DevOps / Site Reliability Engineer (SRE) to ensure the reliability, scalability, performance, and security of our platform and cloud infrastructure. You will play a key role in building and operating cloud-native systems, improving observability, automating operations, implementing SRE best practices (SLOs/SLIs), and supporting development teams to deliver highly available services. Key Responsibilities Design, implement, and maintain highly available and scalable infrastructure on AWS. Own and improve the reliability of production systems using SRE principles (SLO, SLI, error budgets). Build and manage CI/CD pipelines to support fast and safe software delivery. Develop and maintain Infrastructure as Code (IaC) using Terraform, Ansible, CloudFormation, etc. Manage and optimize container orchestration platforms (Kubernetes, Docker, Helm). Implement and maintain monitoring, logging, and alerting solutions (Prometheus, Grafana, ELK, Datadog, Splunk). Lead incident response, perform root cause analysis, and write postmortems to drive continuous improvement. Improve system performance, capacity planning, scaling strategies, and disaster recovery processes. Collaborate closely with development teams to improve deployment strategies and system resilience. Implement security best practices (IAM, secret management, vulnerability scanning, patching). Define operational standards, runbooks, documentation, and best practices for platform reliability. Participate in on-call rotation and provide senior-level support for critical production issues. Key Responsibilities (5 Main Missions) The DevOps / SRE lead will be responsible for the stability and evolution of the platform. Your role is structured around five main areas: Mission 1: AWS Infrastructure Management (Build & Run) Mission 2: CI/CD and Deployment Automation Mission 3: Monitoring, Observability, and Alerting: Global Monitoring , Log Management , Application Monitoring , Business Analytics Mission 4: Incident Management, Resilience, and Security Mission 5: FinOps and AWS Cost Optimization Key Requirements 5+ years of experience in DevOps / SRE / Cloud Infrastructure / Platform Engineering. Strong expertise in Linux systems administration and troubleshooting. Proven experience with Kubernetes in production environments. Strong experience with CI/CD tools (GitLab CI, Jenkins, GitHub Actions, Azure DevOps). Solid knowledge of Infrastructure as Code (Terraform highly preferred). Experience with AWS cloud platforms. Strong understanding of networking fundamentals (TCP/IP, DNS, load balancing, reverse proxies). Experience with observability tools: monitoring, metrics, logging, tracing. Strong scripting skills (Bash, Python, or similar). French advanced level. Nice to Have Experience with additional cloud platforms (Azure, GCP). Strong understanding of networking fundamentals.

Technology

EPAM Systems

Senior Site Reliability Engineer (SRE)

Senior

Remote

🏢 Summary: The offer is for a Site Reliability Engineer responsible for ensuring high reliability, scalability, and performance of cloud-based systems. The role focuses on implementing SRE practices, automating infrastructure, managing incidents, and enhancing monitoring and CI/CD processes. You will collaborate with cross-functional teams to optimize operations and maintain service excellence. 🗂️ Requirements: Bachelor’s degree in Computer Science, Engineering, or related field, 3+ years of experience in Site Reliability Engineering or similar role, Experience with cloud platforms (AWS, GCP, or Azure), Hands-on experience with SRE practices (SLO, SLI, error budgets, postmortems, toil reduction, capacity planning, incident management), Proficiency in Python or other scripting/programming language, Experience with monitoring tools, Experience with CI/CD tools, Experience with infrastructure as code, Experience with configuration management, Knowledge of Kubernetes and Docker, English proficiency B2 or higher 📃 Skills: AWS, GCP, Azure, Python, Kubernetes, Docker, CI/CD, Terraform, Ansible, Monitoring, SLO, SLI, Git, Bash 🏢 Description: We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our team. In this critical role, you will collaborate closely with software developers and operations teams to ensure high reliability, scalability, and efficiency of our systems, with a strong focus on meeting and exceeding customer expectations. Your expertise will be crucial in deploying, maintaining, and automating our infrastructure and application environments to ensure seamless user experiences. Your proactive involvement will be key to enhancing system reliability, optimizing resource utilization, and ensuring continuous improvement in our operational practices. Your responsibilities will include defining and tracking Service Level Objectives (SLOs), managing error budgets, and reducing toil through automation. You will play a pivotal role in driving the success of technology initiatives, maximizing their impact across the organization, and ensuring that solutions consistently meet the high standards our customers expect. Responsibilities Collaborate with development, security, quality, and operation teams to implement SRE practices and ensure system reliability Define and support required level of reliability, availability, and performance for services and applications Design and deliver Cloud-based solutions tailored to client needs Troubleshoot, mitigate, and support fixing of the infrastructure and application issues in a timely manner Implement a monitoring system for the infrastructure and application reliability Communicate technical concepts clearly to both engineering teams and management stakeholders Requirements Bachelor’s degree in Computer Science, Engineering, or a related field 3+ years of hands-on experience in Site Reliability Engineering or related roles Proven experience in any cloud (AWS/GCP/Azure) Experience with implementing SRE practices such as SLO/SLI, Error budgets, Postmortems, Reducing Toil, capacity planning, and Incident Management Python or other scripting/programming language Strong background in monitoring tools Proficiency in CI/CD tools, infrastructure as code, and configuration management Solid knowledge of container orchestration technologies (Kubernetes, Docker) English language proficiency at an Upper-Intermediate level (B2) or higher Nice to have Expertise in deployment and management of LLMs, including technologies like RAG Certification in Kubernetes, AWS/GCP/Azure, or similar technologies Proven experience in DevOps Knowledge of managing and optimizing AI/ML models in production environments, including basic deployment, monitoring, and maintenance We offer/Benefits We gather like-minded people: Engineering community of industry professionals Friendly team and enjoyable working environment Flexible schedule and opportunity to work remotely within Poland Chance to work abroad for up to 60 days annually Business-driven relocation opportunities We provide growth opportunities: Outstanding career roadmap Leadership development, career advising, soft skills, and well-being programs Certification (GCP, Azure, AWS) Unlimited access to LinkedIn Learning, Get Abstract, Cloud Guru English classes We cover it all: Stable income (Employment Contract or B2B) Participation in the Employee Stock Purchase Plan Benefits package (health insurance, multisport, shopping vouchers) Strategically located offices featuring entertainment and relaxation zones, table tennis and football, free snacks, fantastic coffee, and more Referral bonuses Corporate, social and well-being events Please, note: The set of bonuses might vary based on the role you apply for – specifics will be discussed with our recruiter during the general interview. We will reach out to selected candidates exclusively. EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Technology

Grid Dynamics Poland

Senior Site Reliability Engineer (SRE)

Senior

Hybrid

Warsaw, Poland

100 - 128 PLN

🏢 Summary: Senior Site Reliability Engineer role focused on ensuring reliability, performance, and resilience of enterprise products by bridging infrastructure and software engineering. The position involves hands-on Java/Spring Boot code fixes, Kubernetes-based container operations, incident response, and proactive architecture improvements. The engineer drives automation, observability, and security best practices across the SDLC. 🗂️ Requirements: 5+ years experience in SRE or Platform Engineering, Strong proficiency in Java, Strong proficiency in Spring Boot, Experience with Hibernate, Experience with Jenkins, Ability to read, analyze and fix application code, Hands-on experience with Docker, Hands-on experience with Kubernetes, Deep knowledge of Linux systems, Strong understanding of networking, Experience with distributed systems, Experience with monitoring and observability tools, Bachelor’s degree in Computer Science, Systems Engineering or equivalent experience 📃 Skills: Java, Spring, Hibernate, Jenkins, Docker, Kubernetes, Linux, Networking, Prometheus, Grafana, Splunk 🏢 Description: We are looking for an experienced Senior Site Reliability Engineer to join our team and oversee the reliability, resilience, and performance of our core enterprise products. In this role, you will bridge the gap between infrastructure operations and software engineering. You won't just react to alerts - you will proactively analyze system architecture, build automation, and dive deep into the application code (Java/Spring Boot) to fix bugs and eliminate issues at their root. Responsibilities: Architecture & Reliability: Understand the end-to-end product topology from both infrastructure and application perspectives. Identify bottlenecks, scale limitations, and unstable components, driving long-term resolutions before they impact production. Incident Response & RCA: Respond to outages, provide L3 on-call technical support (on rotation), and perform blameless Root Cause Analysis (RCA) to implement permanent fixes. Hands-on Engineering: Address defects, perform code bug fixes directly in production, and recommend architectural improvements during incident analysis. Security & Vulnerability Management: Oversee vulnerability management for applications and containers, manage patching processes, ensure compliance, and monitor certificate expirations and renewals according to global best practices. SRE Advocacy & SDLC: Represent the SRE organization in design reviews, capacity planning, and operational readiness exercises. Partner closely with development teams to embed reliability best practices early in the SDLC. Automation & Mentoring: Build automation tools to reduce manual toil and improve efficiency. Spread SRE culture, create standard documentation, and provide technical mentorship to junior team members. System Health: Oversee the production environment by tracking availability, applying learnings from observability tools, and becoming a Subject Matter Expert (SME) on core issuing products. Min requirements: Experience: 5+ years of experience in Site Reliability Engineering (SRE) or Platform Engineering roles. Software Engineering: Strong proficiency in Java, Spring Boot, Hibernate , and Jenkins. Ability to read, analyze, and fix application code. Containerization: Hands-on expertise with Docker and container orchestration using Kubernetes . Infrastructure: Deep knowledge of Linux systems, networking, and distributed architectures. Observability: Strong understanding of monitoring, logging, and observability tools (e.g., Prometheus, Grafana, Splunk). Education: Bachelor’s degree in Computer Science, Systems Engineering, or equivalent practical experience. Soft Skills: Excellent problem-solving abilities and strong communication skills. Would be a plus: Infrastructure as Code & Cloud: Hands-on experience with tools like Terraform or Ansible, alongside familiarity with major public cloud providers (AWS, GCP, or Azure). Advanced Networking & Service Mesh: Knowledge of service mesh technologies (e.g., Istio, Linkerd) for traffic management, security, and observability in microservices architectures. Industry Experience: Previous background in the FinTech, payments, or banking sectors, with an understanding of high-security compliance standards (e.g., PCI-DSS). We offer: Opportunity to work on bleeding-edge projects Work with a highly motivated and dedicated team Competitive salary Flexible schedule Benefits package - medical insurance, sports Corporate social events Professional development opportunities Well-equipped office About us: Grid Dynamics (NASDAQ: GDYN) is a leading provider of technology consulting, platform and product engineering, AI, and advanced analytics services. Fusing technical vision with business acumen, we solve the most pressing technical challenges and enable positive business outcomes for enterprise companies undergoing business transformation. A key differentiator for Grid Dynamics is our 8 years of experience and leadership in enterprise AI , supported by profound expertise and ongoing investment in data , analytics , cloud & DevOps , application modernization and customer experience . Founded in 2006, Grid Dynamics is headquartered in Silicon Valley with offices across the Americas, Europe, and India.

Technology

Grid Dynamics Poland

Site Reliability Engineer

Senior

Hybrid

Warsaw, Poland

🏢 Summary: Site Reliability Engineer role focused on leading the cloud platform layer of a large-scale enterprise migration to GCP, with full ownership of observability and FinOps capabilities. The position involves architecting cost attribution, distributed tracing, monitoring, and performance engineering solutions in a production-grade Kubernetes environment. You will work on complex distributed systems, extending multi-language codebases and managing infrastructure as code in a regulated enterprise setting. 🗂️ Requirements: 4–6 years software or DevOps engineering experience, 2–3 years hands-on cloud infrastructure management in production, Strong GCP expertise including GKE and Cloud Run, Proven experience building observability solutions with OpenTelemetry, Experience with distributed tracing and profiling in distributed systems, Advanced Python scripting for automation and tooling, Strong Terraform proficiency with multi-environment setups, Ability to read and modify Kotlin and Java codebases, Experience implementing monitoring, alerting, and SLOs for containerized/serverless services, Experience with infrastructure cost attribution and cloud billing APIs 📃 Skills: GCP, GKE, CloudRun, Kubernetes, OpenTelemetry, Terraform, Python, Kotlin, Java, FinOps, PubSub, Bigtable, Docker, SLO, Tracing 🏢 Description: We are looking for a Site Reliability Engineer to join a high-stakes global tech ecosystem and drive the delivery of a critical enterprise platform migration to the cloud. Your core mission will be to architect, build, and productionalize the observability and cost intelligence (FinOps) layer for a massive, multi-year financial platform transformation. You will take end-to-end ownership of the cloud platform layer, giving internal stakeholders full visibility into platform behavior, performance, and infrastructure spend. Working alongside a nearshore team of senior engineers, you will solve highly complex architectural challenges in a production-grade, distributed system. Responsibilities: End-to-End Infrastructure & FinOps Ownership: Architect and implement a cloud usage and cost attribution dashboard, providing detailed per-pod and per-service cost breakdown using cloud billing APIs and internal FinOps hubs. Advanced Observability & Tracing: Instrument end-to-end distributed tracing using OpenTelemetry, configuring collectors within Kubernetes environments and exporting traces to cloud monitoring systems utilizing RED metrics. Performance Engineering & Stress Testing: Write custom tooling from scratch to deliver database performance monitoring, load testing, and trend analysis for critical underlying storage layers. Monitoring & Alerting Automation: Build and deploy scalable production monitoring, custom alerting policies, and SLO tracking for containerized and serverless services. Infrastructure as Code: Independently manage, write, and apply infrastructure modifications using Terraform, working within established enterprise repository standards, modules, and environment state management. Cross-Language Codebase Extension: Read, debug, and extend existing platform code across a diverse stack including Kotlin, Java, and Python to seamlessly integrate technical metrics without disrupting business logic. Quality & Release Assurance: Implement rigorous unit testing with high code coverage for all newly developed monitoring tools to comply with strict enterprise quality gates and sign-offs. Min requirements: Experience: 4 to 6 years of professional software or DevOps engineering experience, with at least 2 to 3 years of hands-on cloud infrastructure management in production. Advanced Cloud Infrastructure: Deep operational proficiency with Google Cloud Platform (GCP), specifically with managing and configuring workload-level alerting on Google Kubernetes Engine (GKE) and Cloud Run. Observability & OpenTelemetry: Proven track record of building observability solutions in distributed systems, using OpenTelemetry (both auto and manual instrumentation) alongside distributed tracing and profiling tools. Strong Automation Scripting: Intermediate-to-advanced fluency in Python for writing custom test tooling, metrics integration scripts, and backend automation from scratch. Solid Infrastructure as Code: Strong proficiency in Terraform, including experience with multi-environment setups, workspaces, and corporate module standards. Polyglot & JVM Familiarity: Practical ability to read, understand, and modify existing backend codebases written in Kotlin and Java. Crucial Non-Technical Skills: Extreme technical autonomy to resolve blockers independently, rapid onboarding skills into large unfamiliar codebases, and fluent written English for async alignment and pull requests. Process Alignment: Ability to thrive in a highly regulated enterprise environment with strict peer reviews, robust documentation requirements, and formal deployment procedures. Would be a plus: Domain Knowledge: Previous experience working within financial services, fintech, investment banking, or other highly regulated industries. Enterprise Streaming Tools: Working knowledge of cloud messaging systems (such as Cloud Pub/Sub) utilized for inter-service communication. Advanced Storage Engines: Familiarity with high-throughput distributed database architectures, such as Google Cloud Bigtable. Systems Languages Awareness: Ability to read or debug foundational code written in low-level systems languages like Rust or C++ during multi-stack production deployments. We offer: Opportunity to work on bleeding-edge projects Work with a highly motivated and dedicated team Competitive salary Flexible schedule Benefits package - medical insurance, sports Corporate social events Professional development opportunities Well-equipped office About us: Grid Dynamics (NASDAQ: GDYN) is a leading provider of technology consulting, platform and product engineering, AI, and advanced analytics services. Fusing technical vision with business acumen, we solve the most pressing technical challenges and enable positive business outcomes for enterprise companies undergoing business transformation. A key differentiator for Grid Dynamics is our 8 years of experience and leadership in enterprise AI , supported by profound expertise and ongoing investment in data , analytics , cloud & DevOps , application modernization and customer experience . Founded in 2006, Grid Dynamics is headquartered in Silicon Valley with offices across the Americas, Europe, and India.

Technology

Link Group

Senior Site Reliability Engineer

Senior

Hybrid

Warsaw, Poland

170 - 230 PLN

🏢 Summary: The role focuses on ensuring reliability, scalability, and performance of large-scale cloud-based applications by building and maintaining resilient infrastructure. You will manage AWS cloud environments, Kubernetes clusters, and CI/CD pipelines while implementing monitoring, automation, and incident response processes. The position emphasizes Infrastructure-as-Code, observability, and continuous reliability improvements. 🗂️ Requirements: 5+ years experience in SRE, DevOps or similar role, Strong experience with AWS cloud services, Experience with Infrastructure-as-Code tools, Hands-on experience with Kubernetes, Proficiency with Docker, Experience with CI/CD pipelines, Solid knowledge of PostgreSQL or Amazon RDS, Strong SQL knowledge, Knowledge of networking concepts (VPC, DNS, troubleshooting), Strong Linux/Unix administration skills, Experience with observability tools, Experience with automation in infrastructure, Experience with incident management 📃 Skills: AWS, Terraform, Pulumi, Kubernetes, EKS, Docker, GitHub, PostgreSQL, RDS, SQL, VPC, DNS, Linux, Unix, Prometheus, Grafana, Datadog, Dynatrace, CI/CD 🏢 Description: We are looking for an experienced Site Reliability Engineer to ensure the reliability, scalability, and performance of large-scale cloud-based web applications. You will work closely with software development, cloud operations, and platform teams to build and maintain resilient infrastructure and improve system stability. Key Responsibilities: Design and maintain monitoring, alerting, and incident response systems to ensure high availability Collaborate closely with engineering, product, and architecture teams Build and manage cloud infrastructure using Infrastructure-as-Code (e.g., Terraform, Pulumi) on AWS Operate and optimize Kubernetes environments (e.g., EKS) Develop and maintain containerized applications using Docker Improve CI/CD pipelines and drive automation across deployment processes Implement and manage observability tools (logging, metrics, tracing) Participate in incident management, postmortems, and reliability improvements Support capacity planning, disaster recovery, and system scaling Contribute to security, compliance, and operational best practices Develop automation and AI-driven solutions for monitoring and incident prevention Requirements: 5+ years of experience in SRE, DevOps, or similar roles Strong experience with AWS cloud services and Infrastructure-as-Code tools Hands-on experience with Kubernetes and containerized environments Proficiency in Docker and CI/CD pipelines (e.g., GitHub Actions) Solid understanding of databases (e.g., PostgreSQL, Amazon RDS) and SQL Knowledge of networking concepts (VPC, DNS, troubleshooting tools like dig/traceroute) Strong Linux/Unix administration skills Experience with observability tools (e.g., Prometheus, Grafana, Datadog, Dynatrace) Familiarity with automation and AI-based solutions in infrastructure Strong problem-solving and incident management skills

Technology

DCV Technologies

Java SRE Engineer

Senior

Hybrid

Warsaw, Poland

900 - 1,100 PLN

🏢 Summary: The offer is for a Java SRE Engineer responsible for ensuring reliability, scalability, and performance of mission-critical fintech systems in a cloud-native environment. The role combines Java development with Site Reliability Engineering practices, focusing on automation, observability, and production support. You will work with modern cloud, container, and AI-powered tools to improve operational excellence and system resilience. 🗂️ Requirements: 5+ years of experience in Java development or Site Reliability Engineering, Strong knowledge of Java, Spring Boot, REST APIs, Microservices, Experience managing production systems in high-availability environments, Strong understanding of monitoring, observability, incident management, root cause analysis, performance tuning, reliability engineering, Experience with Prometheus, Grafana, ELK or OpenSearch, Hands-on experience with Kubernetes and Docker, Experience with at least one cloud platform: AWS, Azure or GCP, Experience with Jenkins and CI/CD pipeline development, Experience with infrastructure automation and scripting, Experience with PostgreSQL, MySQL or Oracle, 2+ years of experience in fintech, payments or financial services, Knowledge of PCI DSS and security/compliance standards 📃 Skills: Java, Spring, SpringBoot, REST, Microservices, SRE, Prometheus, Grafana, ELK, OpenSearch, Splunk, Kubernetes, Docker, AWS, Azure, GCP, Jenkins, CI/CD, PostgreSQL, MySQL, Oracle, Terraform, Ansible, Linux, PCI-DSS 🏢 Description: 🚀 Java SRE Engineer 📍 Hybrid | 2 days from the office in Warsaw Are you passionate about reliability, automation, cloud technologies, and building resilient systems at scale? We're looking for a Java SRE Engineer to join our growing team and help ensure the stability, security, and performance of mission-critical applications within the fintech and payments domain. As part of our engineering organization, you'll combine software engineering expertise with Site Reliability Engineering practices to improve platform reliability, automate operations, and drive operational excellence. You'll also leverage modern AI-powered tools to enhance troubleshooting, automation, and engineering productivity. 💡 Your Mission ⚙️ Site Reliability Engineering Ensure the availability, reliability, scalability, and performance of critical production systems. Design and implement monitoring, alerting, and observability solutions. Manage incident response, root cause analysis, and post-mortem activities. Establish and monitor Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs). Drive reliability improvements through automation and engineering best practices. ☕ Software Engineering Develop and maintain automation tools and platform services using Java. Improve system resilience, performance, and operational efficiency. Collaborate with development teams to build highly reliable applications. Support production environments and participate in on-call rotations when required. ☁️ Cloud & Infrastructure Design and maintain cloud-native infrastructure. Manage containerized environments using Docker and Kubernetes. Implement Infrastructure as Code (IaC) practices. Optimize platform scalability, security, and cost efficiency. 🤖 AI-Powered Engineering Utilize tools such as GitHub Copilot, ChatGPT, Claude, and Cline to improve automation and operational efficiency. Explore AI-driven approaches to incident management, monitoring, and system optimization. Promote best practices for AI adoption within engineering teams. 🔐 Security & Compliance Support implementation of security controls and operational best practices. Assist with compliance requirements and operational audits. Ensure secure handling of sensitive financial and customer data. 🤝 Collaboration Partner closely with Software Engineers, DevOps Engineers, Security Teams, and Product Owners. Participate in Agile ceremonies and reliability planning sessions. Contribute to technical documentation and knowledge-sharing initiatives. 🎯 Must-Have Skills Backend & Automation ✔️ 5+ years of experience in Java development or Site Reliability Engineering ✔️ Strong knowledge of: Java Spring Boot REST APIs Microservices architecture SRE & Operations ✔️ Experience managing production systems in high-availability environments ✔️ Strong understanding of: Monitoring and observability Incident management Root cause analysis Performance tuning Reliability engineering principles ✔️ Experience with: Prometheus Grafana ELK Stack or OpenSearch Splunk (nice to have) Cloud & Containers ✔️ Hands-on experience with: Kubernetes Docker ✔️ Experience with cloud platforms: AWS Azure GCP CI/CD & Automation ✔️ Jenkins ✔️ CI/CD pipeline development and maintenance ✔️ Infrastructure automation and scripting Databases ✔️ Experience with: PostgreSQL MySQL Oracle ⭐ Nice to Have Experience with Terraform, Ansible, or Infrastructure as Code tools Experience supporting fintech or payment platforms Knowledge of Linux system administration Experience with distributed systems and event-driven architectures Open-source contributions Technical blogging or community involvement 💳 FinTech & Payments Experience 2+ years of experience in fintech, payments, or financial services Knowledge of PCI DSS requirements Familiarity with payment processing systems and financial transaction platforms Understanding of security and compliance standards within regulated industries 🎁 What You'll Get ✨ Opportunity to work on large-scale, business-critical systems ✨ Modern cloud-native technology stack ✨ Exposure to AI-powered engineering and automation tools ✨ Hybrid work model (Gdańsk or Warsaw) ✨ Collaborative engineering culture focused on innovation and reliability ✨ Influence on platform architecture and operational excellence ✨ Continuous learning and career development opportunities 🚀 Ready to build reliable, scalable systems that power the future of fintech? Join us and help create highly available platforms where engineering excellence, automation, and innovation come first.

Technology

AgileEngine

Site Reliability Engineer ID60188

Mid

Remote

Krakow, Poland

4,300 - 7,700 USD

🏢 Summary: Hands-on SRE Operations Engineer role focused on ensuring reliability and performance of a cloud-based SaaS platform. The position involves managing Kubernetes infrastructure, improving observability, supporting CI/CD and GitOps workflows, and automating operational processes in AWS environments. Includes active participation in incident response and on-call rotations. 🗂️ Requirements: 2+ years in SRE, DevOps, or Production Operations, Experience with AWS in production environments, Experience supporting production SaaS applications, Strong knowledge of CI/CD systems, Experience with GitOps and Git, Experience with Kubernetes (EKS or kOps), Experience with Docker and containerization, Experience with observability and monitoring tools, Experience with scripting (Bash, Python, or Go), Experience with Infrastructure as Code (Terraform or Helm), Experience using GitHub, Jira, and Confluence 📃 Skills: AWS, Kubernetes, EKS, kOps, Docker, Terraform, Helm, Grafana, Prometheus, Loki, PagerDuty, Git, GitHub, Jenkins, CircleCI, GitHubActions, CICD, Bash, Python, Go, Jira, Confluence, SRE, SaaS 🏢 Description: AgileEngine is an Inc. 5000 company that creates award-winning software for Fortune 500 brands and trailblazing startups across 17+ industries. We rank among the leaders in areas like application development and AI/ML, and our people-first culture has earned us multiple Best Place to Work awards. Why join us If you're looking for a place to grow, make an impact, and work with people who care, we'd love to meet you! :) About the role We are looking for an SRE Operations Engineer to keep production and staging environments running reliably across a cloud-based SaaS platform. You’ll respond to live incidents, reduce operational toil through automation, and improve observability using Kubernetes, Terraform, Grafana, and AWS. A hands-on role with real ownership across CI/CD pipelines, GitOps workflows, and on-call rotations. What you will do Monitor and support production and staging environments in real time, ensuring high availability, performance, and stability; Respond to incidents, perform triage and root cause analysis, and contribute to post-incident reviews and remediation efforts; Participate in an on-call rotation with defined SLAs; Handle ad-hoc and unplanned operational requests from Product, Support, and internal teams; Maintain and enhance monitoring, alerting, dashboards, logs, and metrics, and improve observability practices; Support CI/CD pipelines, production releases, and GitOps workflows; Contribute to automation efforts to reduce operational toil; Maintain and improve Kubernetes-based infrastructure and containerized workloads; Support Infrastructure as Code practices and ongoing environment improvements. Must haves 2+ years of experience in Site Reliability Engineering, DevOps, or Production Operations ; Experience with AWS supporting production environments; Experience supporting production SaaS applications ; Strong understanding of CI/CD systems such as GitHub Actions, Jenkins, or CircleCI; Experience with GitOps and strong Git fundamentals; Experience using GitHub, Jira, and Confluence in collaborative environments; Experience with Kubernetes such as EKS or kOps; Experience with Docker and containerization; Experience with observability tools such as Grafana, Prometheus, Loki, or PagerDuty; Experience with scripting languages such as Bash, Python, or Go; Experience with Infrastructure as Code such as Terraform or Helm; Ability to work within structured operational processes and SLAs; Strong written and verbal English communication skills ; Self-driven with a growth mindset. Nice to haves AWS certifications such as Solutions Architect, DevOps Engineer, or SysOps Administrator ; Experience in multi-tenant SaaS environments ; Experience working in globally distributed teams ; Familiarity with ChatOps practices ; Experience improving monitoring quality and reducing alert fatigue. The benefits of joining us Professional growth Accelerate your professional journey with mentorship, TechTalks, and personalized growth roadmaps Competitive compensation We match your ever-growing skills, talent, and contributions with competitive USD-based compensation and budgets for education, fitness, and team activities A selection of exciting projects Join projects with modern solutions development and top-tier clients that include Fortune 500 enterprises and leading product brands Flextime Tailor your schedule for an optimal work-life balance, by having the options of working from home and going to the office – whatever makes you the happiest and most productive. Meet Our Recruitment Process Asynchronous stage – An automated, self-paced track that helps us move faster and give you quicker feedback: Short online form to confirm basic requirements 30–60 minute skills assessment via Codility – a platform founded in Poland that helps us provide quicker feedback and streamline this stage of the process. 5-minute introduction video Synchronous stage – Live interviews Technical interview with our engineering team (scheduled at your convenience) Final interview with your future teammates If it’s a match — you’ll get an offer!

Technology

Link Group

DevOps / Site Reliability Engineer

Mid

Hybrid

Kraków, Poland

20,000 - 25,000 PLN

🏢 Summary: DevOps / Site Reliability Engineer role focused on building and maintaining scalable cloud infrastructure while improving platform reliability and automation. The position centers on Kubernetes-based environments, CI/CD pipeline development, and enhancing monitoring and observability. The engineer will support development teams through infrastructure as code and internal developer platform initiatives. 🗂️ Requirements: Experience with cloud platforms (Azure preferred), Strong experience with Kubernetes, Strong knowledge of Infrastructure as Code (Terraform), Hands-on experience with CI/CD tools, Experience with monitoring and observability tools, Understanding of scalability, reliability, and security best practices 📃 Skills: Azure, Kubernetes, Terraform, GitHubActions, ArgoCD, CI/CD, Datadog, Prometheus, Grafana, MongoDB, Rancher, Jenkins, PowerBI, Jira, Confluence 🏢 Description: DevOps / Site Reliability Engineer We’re looking for a DevOps / SRE to help build and maintain scalable cloud infrastructure and improve reliability across our platform. You’ll focus on automation, CI/CD, and supporting development teams with efficient tooling and processes. Key responsibilities Develop and manage cloud infrastructure (Azure preferred) Work with Kubernetes and containerized environments Build and maintain CI/CD pipelines (GitHub Actions, ArgoCD) Automate deployments and operational processes Contribute to Internal Developer Platform (IDP) development Improve monitoring and observability (e.g., Datadog, Prometheus, Grafana) Requirements Experience with cloud platforms and Kubernetes Strong knowledge of Infrastructure as Code (e.g., Terraform) Hands-on experience with CI/CD tools Understanding of scalability, reliability, and security best practices Experience with monitoring/observability tools Nice to have Experience with MongoDB Atlas, Rancher, Jenkins, Power BI Familiarity with Jira, Confluence

Technology

Caspian One

Site Reliability Engineer

Senior

Hybrid

Krakow, Poland

1,400 - 1,800 PLN

🏢 Summary: Hands-on Site Reliability Engineer role focused on ensuring stability, scalability, and observability of a mission-critical distributed risk and analytics platform in hybrid cloud environments. The position centers on production reliability, incident response, automation, and continuous improvement of monitoring and deployment processes. You will collaborate with engineering teams to strengthen system resilience, performance, and operational standards. 🗂️ Requirements: Strong Java experience in distributed systems, Experience with observability and monitoring tools, Hands-on experience with hybrid cloud environments (preferably GCP), Experience with CI/CD pipelines and automation tools, Solid knowledge of Linux systems administration, Understanding of RDBMS fundamentals, Experience with job schedulers (e.g., Control-M), Ability to lead incident response and root-cause analysis 📃 Skills: Java, Grafana, Prometheus, Loki, OpenTelemetry, GCP, Jenkins, Ansible, Linux, SQL, Control-M, CI/CD 🏢 Description: We’re looking for a seasoned Site Reliability Engineer to support a high‑performance, mission‑critical risk and analytics platform used across global trading and finance environments. You’ll play a key role in ensuring the stability, scalability, and observability of complex distributed systems running across hybrid cloud infrastructure. In this role, you’ll take ownership of production reliability driving incident response, conducting root‑cause analysis, improving monitoring capabilities, and delivering automation that reduces operational toil. You’ll work closely with development teams, platform engineers, and service management leads to strengthen resilience, refine processes, and enhance the engineering culture around availability and performance. This is a hands on technical position suited to someone who thrives in high‑throughput environments, communicates clearly, and enjoys solving deep engineering problems in real time. Core Responsibilities Maintain and improve the reliability, uptime, and performance of distributed applications. Lead incident response, triage complex issues, coordinate recoveries, and deliver structured post‑incident reviews. Enhance observability—designing and evolving monitoring, alerting, logging, and tracing frameworks. Drive continuous improvement across automation, deployment processes, and service stability. Collaborate with cross‑functional teams to influence architecture, design, and operational standards. Support CI/CD pipelines, environment configuration, and vulnerability remediation. Contribute to a knowledge‑driven culture through documentation, tooling, and best‑practice adoption. Required Skills & Experience Strong Java background with proven experience supporting or developing distributed systems. Observability tooling expertise (Grafana, Prometheus, Loki, OpenTelemetry or similar). Hands‑on with hybrid cloud environments , ideally with GCP or another major cloud provider. CI/CD and automation experience (e.g., Jenkins, Ansible). Solid understanding of Linux , RDBMS fundamentals , and job schedulers (e.g., Control‑M or equivalents). Strong analytical mindset with a methodical approach to troubleshooting. Excellent communication skills and comfort working in Agile teams.

Yard Corporate

Yard Corporate is a global cybersecurity startup that focuses on revolutionizing security data management through a scientific approach utilizing AI and streaming data. The company aims to transform how organizations handle security data by separating it from compliance, which allows for real-time threat detection at reduced costs. Yard Corporate serves a clientele that includes Fortune 500 companies across the financial, healthcare, and insurance sectors. As an early-stage startup, it offers significant opportunities for foundational engineering and innovation in the cybersecurity industry.

Check if your resume is ATS-ready before applying →Build an ATS-optimized resume