June 8, 2026

Site Reliability Engineer

Senior • Remote

140 - 180 PLN

Warsaw, Poland

SRE (k/m)

Miejsce pracy: 100% zdalnie
Start: kwiecień/maj 2026
Forma współpracy: B2B

Dołącz do zespołu pracującego nad globalną infrastrukturą streamingową!

Poszukujemy doświadczonego Site Reliability Engineer (SRE), który wesprze rozwój i utrzymanie międzynarodowych platform CDN — zarówno on‑prem, jak i w chmurze. Na tym stanowisku będziesz mieć realny wpływ na niezawodność i jakość usług OTT dostarczanych użytkownikom na wielu kontynentach.

Twoja rola

Jako część zespołu CDN Operations będziesz odpowiadać za niezawodność, skalowalność oraz operacyjne doskonałe działanie platform CDN, automatyzację procesów oraz rozwój narzędzi wspierających monitoring i analizę.

Zakres obowiązków

CDN Reliability & Operations

Zapewnianie dostępności, odporności i wysokiej wydajności platform CDN (cloud, baremetal, sieci międzynarodowe, IX‑y, cache po stronie ISP).
Regularna analiza pojemności CDN, wydajności i prognoz ruchu.
Udział w wdrożeniach, rolloutach produkcyjnych i analizie konsumpcji OTT w wielu regionach.
Monitorowanie kluczowych wskaźników (latency, throughput, cache hit ratio, error rate) i wdrażanie optymalizacji.
Udział w obsłudze incydentów, RCA oraz wdrażaniu planów poprawy niezawodności.
Okazjonalne wsparcie zespołów DevOps w zadaniach operacyjnych.

Observability & Monitoring

Budowa i utrzymanie warstwy obserwowalności dla wszystkich środowisk CDN (Datadog).
Tworzenie i utrzymywanie zestandaryzowanych dashboardów, alertów, SLO/SLA oraz pipeline’ów logów.
Projektowanie skalowalnych rozwiązań monitoringowych zdolnych obsłużyć duże wolumeny ruchu.
Implementacja automatycznych health‑checków, wykrywania anomalii i workflowów alertowania.
Usprawnianie procesów zbierania i wizualizacji danych dla zespołów technicznych i biznesowych.

Development of Tools & Automation

Tworzenie skryptów i workflowów (Python/Bash/API) do zbierania metryk, analizy kosztów i danych operacyjnych.
Budowa narzędzi wewnętrznych do:
- analizy logów,
- wizualizacji audience i ruchu,
- walidacji konfiguracji CDN,
- diagnostyki i troubleshooting’u,
- testów cache.
Wsparcie automatyzacji w oparciu o Terraform, CI/CD i automatyczne rollouty konfiguracji.

Collaboration & CDN Governance

Współpraca z zespołami OTT engineering, DevOps, Network, Security, Data i jednostkami międzynarodowymi.
Tworzenie i rozwój globalnych standardów (latency, TTL, caching, obserwowalność, bezpieczeństwo, koszty).
Dzielenie się wiedzą z zespołami w wielu regionach (Europa, Afryka, Azja).
Przygotowywanie dokumentacji technicznej i materiałów wdrożeniowych.
Współpraca z ISP, dostawcami chmury i zespołami operacyjnymi w rozwiązywaniu problemów dystrybucyjnych.
Wsparcie podczas dużych wydarzeń (sport, live, peak traffic) — przygotowanie, monitoring i analiza po wydarzeniu.

Wymagania

Doświadczenie i wykształcenie

Wyższe wykształcenie techniczne (Informatyka, Sieci/Telekomunikacja).
Min. 4–5 lat doświadczenia w rolach SysOps / DevOps / SRE.

Kompetencje techniczne

Solidne podstawy sieciowe: DNS, TCP, HTTP, routing (BGP), caching, proxy.
Znajomość narzędzi: Terraform, Ansible, AWS Lambda, GitLab CI/CD.
Bardzo dobra znajomość systemów Unix/Linux.
Doświadczenie z narzędziami monitoringu (Datadog, Grafana).
Mile widziana znajomość CDN/OTT oraz zagadnień QoS.

Umiejętności miękkie

Analityczne myślenie, samodzielność i dobra organizacja pracy.
Umiejętność współpracy z zespołami technicznymi i nietechnicznymi.
Biegła znajomość języka angielskiego; francuski mile widziany.

Motywacja

Chęć pracy nad systemami dużej skali i wysokiej dostępności.
Zainteresowanie automatyzacją, obserwowalnością i performance engineering.

Warunki współpracy

Start: kwiecień/maj 2026
Forma: B2B
Docelowo: współpraca długoterminowa
Tryb pracy: w 100% zdalny
Benefity: Karta Multisport oraz Luxmed

Co zyskasz dzięki aplikacji na ofertę Antal?

Gdy Twoja aplikacja zostanie rozpatrzona pozytywnie (zostaniesz zaproszony/a do procesu), otrzymasz wsparcie Konsultanta/Konsultantki, który/a utrzyma z Tobą stały kontakt (mailowo lub telefonicznie), pomoże Ci przygotować się do rozmowy rekrutacyjnej z przyszłym pracodawcą oraz zatroszczy się o jakość procesu rekrutacyjnego, w którym aktualnie bierzesz udział.

Kim jesteśmy?

Jesteśmy liderem rekrutacji specjalistów i menedżerów oraz doradztwa w obszarze HR. Marka obecna jest w 35 krajach, w Polsce działa od 1996 roku. Przez ten czas zbudowaliśmy wiele karier kandydatów, dzięki elastycznemu i kompleksowemu podejściu do wszystkich rekrutacji. Antal tworzy ponad 130 profesjonalnych konsultantów ds. rekrutacji, którzy są oni nie tylko skutecznymi rekruterami, ale także wykwalifikowanymi doradcami, specjalizującymi się zarówno w zakresie wybranego sektora, jak i stanowiska.

Similar jobs you might like

Technology

Antal Sp. z o.o.

Senior Application Reliability Engineer

Senior

Hybrid

Krakow, Poland

🏢 Summary: Site Reliability Engineer role in a global financial environment focused on ensuring 24/7 reliability, automation, and scalability of critical production systems. The position involves incident management, architectural input, observability development, and continuous platform improvement within an international DevOps team. Hybrid work model with rotational on-call duties. 🗂️ Requirements: Minimum 7 years of experience in SRE or production application support, Experience maintaining 24/7 production systems, Strong troubleshooting and incident management skills, Experience with Ansible, Jenkins, Prometheus, Grafana, Programming skills in Java, Python, or JavaScript, Experience with Node.js or React, Knowledge of SQL, Practical knowledge of SDLC, Experience defining and monitoring SLI/SLO, Experience with migrations, upgrades, and disaster recovery, Willingness to participate in on-call rotation 📃 Skills: SRE, Ansible, Jenkins, Prometheus, Grafana, Java, Python, JavaScript, Node.js, React, SQL, SDLC 🏢 Description: Site Reliability Engineer (SRE) Kraków (hybryda – 6 dni/miesiąc z biura) O projekcie Nasz Klient wspiera globalną organizację finansową przy rozwoju i utrzymaniu krytycznych systemów działających 24/7. To rola w międzynarodowym zespole DevOps, gdzie niezawodność, automatyzacja i skalowalność są kluczowe. Będziesz mieć realny wpływ na stabilność usług, decyzje architektoniczne oraz kierunek rozwoju platform technologicznych. Twoja rola Zapewnienie wysokiej dostępności i niezawodności systemów produkcyjnych (24/7) Wdrażanie rozwiązań zgodnych z praktykami SRE (monitoring, automatyzacja, optymalizacja) Analiza i rozwiązywanie incydentów + root cause analysis Udział w projektowaniu architektury systemów Definiowanie i monitorowanie SLI/SLO oraz rozwój observability Planowanie i realizacja migracji, upgrade’ów oraz testów disaster recovery Automatyzacja procesów i rozwój self-service dla użytkowników Wsparcie użytkowników i ciągłe ulepszanie doświadczenia końcowego Udział w dyżurach on-call (rotacyjnie) Udział w zaplanowanych pracach utrzymaniowych Wymagania Min. 7 lat doświadczenia w SRE lub wsparciu aplikacji produkcyjnych Bardzo dobre umiejętności troubleshootingu i pracy pod presją Doświadczenie z narzędziami: Ansible Jenkins Prometheus Grafana Umiejętności programistyczne (full-stack), np.: Java / Python / JavaScript Node.js / React SQL Praktyczna znajomość SDLC Bardzo dobre umiejętności komunikacyjne i doświadczenie w pracy w środowisku międzynarodowym Mile widziane Doświadczenie z Jira i Confluence (Data Center) Szybka adaptacja do nowych technologii i środowisk Co zyskasz dzięki aplikacji na ofertę Antal? Gdy Twoja aplikacja zostanie rozpatrzona pozytywnie (zostaniesz zaproszony/a do procesu), otrzymasz wsparcie Konsultanta/Konsultantki, który/a utrzyma z Tobą stały kontakt (mailowo lub telefonicznie), pomoże Ci przygotować się do rozmowy rekrutacyjnej z przyszłym pracodawcą oraz zatroszczy się o jakość procesu rekrutacyjnego, w którym aktualnie bierzesz udział. Kim jesteśmy? Jesteśmy liderem rekrutacji specjalistów i menedżerów oraz doradztwa w obszarze HR. Marka obecna jest w 35 krajach, w Polsce działa od 1996 roku. Przez ten czas zbudowaliśmy wiele karier kandydatów, dzięki elastycznemu i kompleksowemu podejściu do wszystkich rekrutacji. Antal tworzy ponad 130 profesjonalnych konsultantów ds. rekrutacji, którzy są oni nie tylko skutecznymi rekruterami, ale także wykwalifikowanymi doradcami, specjalizującymi się zarówno w zakresie wybranego sektora, jak i stanowiska. Sprawdź inne aktualne oferty pracy na: https://antal.pl/dla-kandydata Zaobserwuj nasz profil na LinkedIn: https://www.linkedin.com/company/antalpoland

Technology

EPAM Systems

Senior Site Reliability Engineer (SRE)

Senior

Remote

🏢 Summary: The offer is for a Site Reliability Engineer responsible for ensuring high reliability, scalability, and performance of cloud-based systems. The role focuses on implementing SRE practices, automating infrastructure, managing incidents, and enhancing monitoring and CI/CD processes. You will collaborate with cross-functional teams to optimize operations and maintain service excellence. 🗂️ Requirements: Bachelor’s degree in Computer Science, Engineering, or related field, 3+ years of experience in Site Reliability Engineering or similar role, Experience with cloud platforms (AWS, GCP, or Azure), Hands-on experience with SRE practices (SLO, SLI, error budgets, postmortems, toil reduction, capacity planning, incident management), Proficiency in Python or other scripting/programming language, Experience with monitoring tools, Experience with CI/CD tools, Experience with infrastructure as code, Experience with configuration management, Knowledge of Kubernetes and Docker, English proficiency B2 or higher 📃 Skills: AWS, GCP, Azure, Python, Kubernetes, Docker, CI/CD, Terraform, Ansible, Monitoring, SLO, SLI, Git, Bash 🏢 Description: We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our team. In this critical role, you will collaborate closely with software developers and operations teams to ensure high reliability, scalability, and efficiency of our systems, with a strong focus on meeting and exceeding customer expectations. Your expertise will be crucial in deploying, maintaining, and automating our infrastructure and application environments to ensure seamless user experiences. Your proactive involvement will be key to enhancing system reliability, optimizing resource utilization, and ensuring continuous improvement in our operational practices. Your responsibilities will include defining and tracking Service Level Objectives (SLOs), managing error budgets, and reducing toil through automation. You will play a pivotal role in driving the success of technology initiatives, maximizing their impact across the organization, and ensuring that solutions consistently meet the high standards our customers expect. Responsibilities Collaborate with development, security, quality, and operation teams to implement SRE practices and ensure system reliability Define and support required level of reliability, availability, and performance for services and applications Design and deliver Cloud-based solutions tailored to client needs Troubleshoot, mitigate, and support fixing of the infrastructure and application issues in a timely manner Implement a monitoring system for the infrastructure and application reliability Communicate technical concepts clearly to both engineering teams and management stakeholders Requirements Bachelor’s degree in Computer Science, Engineering, or a related field 3+ years of hands-on experience in Site Reliability Engineering or related roles Proven experience in any cloud (AWS/GCP/Azure) Experience with implementing SRE practices such as SLO/SLI, Error budgets, Postmortems, Reducing Toil, capacity planning, and Incident Management Python or other scripting/programming language Strong background in monitoring tools Proficiency in CI/CD tools, infrastructure as code, and configuration management Solid knowledge of container orchestration technologies (Kubernetes, Docker) English language proficiency at an Upper-Intermediate level (B2) or higher Nice to have Expertise in deployment and management of LLMs, including technologies like RAG Certification in Kubernetes, AWS/GCP/Azure, or similar technologies Proven experience in DevOps Knowledge of managing and optimizing AI/ML models in production environments, including basic deployment, monitoring, and maintenance We offer/Benefits We gather like-minded people: Engineering community of industry professionals Friendly team and enjoyable working environment Flexible schedule and opportunity to work remotely within Poland Chance to work abroad for up to 60 days annually Business-driven relocation opportunities We provide growth opportunities: Outstanding career roadmap Leadership development, career advising, soft skills, and well-being programs Certification (GCP, Azure, AWS) Unlimited access to LinkedIn Learning, Get Abstract, Cloud Guru English classes We cover it all: Stable income (Employment Contract or B2B) Participation in the Employee Stock Purchase Plan Benefits package (health insurance, multisport, shopping vouchers) Strategically located offices featuring entertainment and relaxation zones, table tennis and football, free snacks, fantastic coffee, and more Referral bonuses Corporate, social and well-being events Please, note: The set of bonuses might vary based on the role you apply for – specifics will be discussed with our recruiter during the general interview. We will reach out to selected candidates exclusively. EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Technology

Antal Sp. z o.o.

Cloud & Platform Engineer (m/f)

Senior

Hybrid

Opole, Poland

🏢 Summary: Cloud & Platform Reliability Engineer responsible for designing, building and maintaining highly available cloud-native infrastructure with a focus on reliability, security and observability. The role combines Platform Engineering, SRE and DevOps practices, supporting scalable Kubernetes environments in public cloud. You will drive automation, CI/CD, IaC and monitoring standards while ensuring system performance and resilience. 🗂️ Requirements: Minimum 3–4 years experience with cloud-native solutions, Hands-on experience with Docker and Kubernetes, Experience with public cloud (preferably Azure / AKS), Strong knowledge of CI/CD tools, Experience with observability tools (monitoring, logging, tracing), Experience with Infrastructure as Code tools, Knowledge of networking, security and troubleshooting, Automation skills using Python, Bash or PowerShell, English level B2 or higher 📃 Skills: Kubernetes, Docker, Azure, AKS, Terraform, Helm, Prometheus, Grafana, ELK, GitHub, CI/CD, Python, Bash, PowerShell, IaC, Monitoring, Logging, Tracing 🏢 Description: Dla naszego Klienta – międzynarodowej organizacji realizującej zaawansowane projekty transformacji cyfrowej w środowiskach produkcyjnych – poszukujemy doświadczonego Cloud & Platform Reliability Engineer (m/f). Osoba na tym stanowisku będzie odpowiedzialna za projektowanie, rozwój i utrzymanie nowoczesnej infrastruktury chmurowej wspierającej aplikacje o wysokiej dostępności, ze szczególnym naciskiem na niezawodność, bezpieczeństwo oraz obserwowalność systemów. Rola łączy elementy Platform Engineering, SRE oraz DevOps. Zakres obowiązków Projektowanie i wdrażanie rozwiązań z obszaru observability (monitoring, metryki, logowanie, distributed tracing) w środowiskach kontenerowych Budowa, konfiguracja i utrzymanie platform Kubernetes w chmurze publicznej, Automatyzacja procesów dostarczania oprogramowania poprzez rozwój i utrzymanie pipeline’ów CI/CD Tworzenie, utrzymanie i optymalizacja obrazów kontenerowych Docker Stosowanie podejścia Infrastructure as Code (IaC) Monitorowanie wydajności infrastruktury, identyfikacja oraz proaktywne rozwiązywanie problemów Współpraca z zespołami developerskimi w zakresie definiowania SLI/SLO oraz skutecznych mechanizmów alertowania Udział w analizie incydentów, zapewnienie ciągłości działania systemów oraz wdrażanie usprawnień zapobiegawczych Tworzenie i aktualizacja dokumentacji technicznej, procedur operacyjnych Wymagania Minimum 3/4 lata doświadczenia w pracy z rozwiązaniami cloud‑native Praktyczne doświadczenie z Dockerem i Kubernetesem Doświadczenie w pracy z chmurą publiczną (preferowane Azure / AKS) Bardzo dobra znajomość narzędzi CI/CD (np. GitHub Actions) Doświadczenie z narzędziami observability (np. Prometheus, Grafana, ELK) Znajomość narzędzi Infrastructure as Code (Terraform, Helm) Dobra znajomość zagadnień sieciowych, bezpieczeństwa oraz troubleshootingu Umiejętność automatyzacji z wykorzystaniem Python, Bash lub PowerShell Język angielski na poziomie komunikatywnym (min. B2) Oferta Praca przy nowoczesnych, skalowalnych platformach chmurowych Realny wpływ na architekturę i standardy technologiczne Stabilna współpraca w międzynarodowym środowisku Nastawienie na długoterminową współpracę Zatrudnienie na umowę o pracę bezpośrednio przez Klienta Antal Atrakcyjne wynagrodzenie oraz pakiet benefitów Lokalizacja: Opole (on-site/hybryd) Co zyskasz dzięki aplikacji na ofertę Antal? Gdy Twoja aplikacja zostanie rozpatrzona pozytywnie (zostaniesz zaproszony/a do procesu), otrzymasz wsparcie Konsultanta/Konsultantki, który/a utrzyma z Tobą stały kontakt (mailowo lub telefonicznie), pomoże Ci przygotować się do rozmowy rekrutacyjnej z przyszłym pracodawcą oraz zatroszczy się o jakość procesu rekrutacyjnego, w którym aktualnie bierzesz udział. Kim jesteśmy? Jesteśmy liderem rekrutacji specjalistów i menedżerów oraz doradztwa w obszarze HR. Marka obecna jest w 35 krajach, w Polsce działa od 1996 roku. Przez ten czas zbudowaliśmy wiele karier kandydatów, dzięki elastycznemu i kompleksowemu podejściu do wszystkich rekrutacji. Antal tworzy ponad 130 profesjonalnych konsultantów ds. rekrutacji, którzy są oni nie tylko skutecznymi rekruterami, ale także wykwalifikowanymi doradcami, specjalizującymi się zarówno w zakresie wybranego sektora, jak i stanowiska. Sprawdź inne aktualne oferty pracy na: https://antal.pl/dla-kandydata Zaobserwuj nasz profil na LinkedIn: https://www.linkedin.com/company/antalpoland

Technology

emagine Polska

Senior DevOps / SRE (Platform Reliability Engineer) - French fluent

Senior

Remote

Lisbon, Portugal

🏢 Summary: Senior DevOps / SRE role focused on ensuring reliability, scalability, security, and performance of a cloud-native AWS platform. The position centers on infrastructure automation, CI/CD, Kubernetes operations, observability, and implementing SRE best practices to support highly available production systems. You will lead incident management, optimize cloud costs, and drive continuous improvement of platform resilience. 🗂️ Requirements: 5+ years in DevOps/SRE/Cloud/Platform Engineering, Strong Linux administration and troubleshooting, Production experience with Kubernetes, Experience with CI/CD tools, Expertise in Infrastructure as Code, Hands-on experience with AWS, Strong networking fundamentals, Experience with monitoring and logging tools, Scripting skills (Bash or Python) 📃 Skills: AWS, Kubernetes, Docker, Helm, Terraform, Ansible, CloudFormation, Linux, GitLab, Jenkins, GitHub, Azure, Prometheus, Grafana, ELK, Datadog, Splunk, Bash, Python, TCP/IP, DNS 🏢 Description: We are looking for a Senior DevOps / Site Reliability Engineer (SRE) to ensure the reliability, scalability, performance, and security of our platform and cloud infrastructure. You will play a key role in building and operating cloud-native systems, improving observability, automating operations, implementing SRE best practices (SLOs/SLIs), and supporting development teams to deliver highly available services. Key Responsibilities Design, implement, and maintain highly available and scalable infrastructure on AWS. Own and improve the reliability of production systems using SRE principles (SLO, SLI, error budgets). Build and manage CI/CD pipelines to support fast and safe software delivery. Develop and maintain Infrastructure as Code (IaC) using Terraform, Ansible, CloudFormation, etc. Manage and optimize container orchestration platforms (Kubernetes, Docker, Helm). Implement and maintain monitoring, logging, and alerting solutions (Prometheus, Grafana, ELK, Datadog, Splunk). Lead incident response, perform root cause analysis, and write postmortems to drive continuous improvement. Improve system performance, capacity planning, scaling strategies, and disaster recovery processes. Collaborate closely with development teams to improve deployment strategies and system resilience. Implement security best practices (IAM, secret management, vulnerability scanning, patching). Define operational standards, runbooks, documentation, and best practices for platform reliability. Participate in on-call rotation and provide senior-level support for critical production issues. Key Responsibilities (5 Main Missions) The DevOps / SRE lead will be responsible for the stability and evolution of the platform. Your role is structured around five main areas: Mission 1: AWS Infrastructure Management (Build & Run) Mission 2: CI/CD and Deployment Automation Mission 3: Monitoring, Observability, and Alerting: Global Monitoring , Log Management , Application Monitoring , Business Analytics Mission 4: Incident Management, Resilience, and Security Mission 5: FinOps and AWS Cost Optimization Key Requirements 5+ years of experience in DevOps / SRE / Cloud Infrastructure / Platform Engineering. Strong expertise in Linux systems administration and troubleshooting. Proven experience with Kubernetes in production environments. Strong experience with CI/CD tools (GitLab CI, Jenkins, GitHub Actions, Azure DevOps). Solid knowledge of Infrastructure as Code (Terraform highly preferred). Experience with AWS cloud platforms. Strong understanding of networking fundamentals (TCP/IP, DNS, load balancing, reverse proxies). Experience with observability tools: monitoring, metrics, logging, tracing. Strong scripting skills (Bash, Python, or similar). French advanced level. Nice to Have Experience with additional cloud platforms (Azure, GCP). Strong understanding of networking fundamentals.

Technology

Link Group

Site Reliability Engineer

Mid

Hybrid

Warsaw, Poland

🏢 Summary: Hands-on Site Reliability Engineer role focused on building and scaling reliability practices across cloud and on-prem environments. The position involves improving performance, scalability, and resilience of production systems through automation, observability, and Kubernetes-based infrastructure. You will drive SRE standards and collaborate with engineering teams to enhance system stability and fault tolerance. 🗂️ Requirements: 4+ years experience in SRE, DevOps or similar roles, Strong experience with distributed systems, Strong experience with Kubernetes, Experience with AWS cloud, Hands-on automation experience with Python, Bash or Go, Solid understanding of CI/CD practices, Experience with observability and monitoring tools, Experience managing production systems 📃 Skills: Kubernetes, AWS, Python, Bash, Go, Prometheus, Grafana, CI/CD, SRE, DevOps 🏢 Description: We’re looking for a Site Reliability Engineer (SRE) to help build and scale reliability practices across our engineering organization. This is a hands-on role where you’ll work across cloud and on-prem environments, improving the performance, scalability, and resilience of critical production systems. 🔧 What you’ll be doing: • Driving SRE best practices, standards, and ways of working • Building and scaling observability & monitoring solutions (e.g. Prometheus, Grafana) • Working with Kubernetes-based infrastructure to ensure reliability and efficiency • Automating deployments, incident response, and recovery processes • Collaborating closely with engineering teams to improve system stability and fault tolerance • Contributing to a strong reliability culture (SLOs, post-mortems, continuous improvement) ✅ What we’re looking for: • 4+ years of experience in SRE / DevOps / similar roles • Strong experience with distributed systems, Kubernetes, and cloud (AWS preferred) • Hands-on approach to automation (Python, Bash, or Go) • Solid understanding of CI/CD and modern software delivery • Proactive mindset and strong ownership of production systems Name and surname*

Technology

Antal Sp. z o.o.

Senior Engineer - SRE Lead

Senior

Hybrid

Krakow, Poland

🏢 Summary: Leadership role responsible for managing and developing an SRE team supporting critical global IT services and business platforms. The position focuses on driving reliability, scalability, performance, and security through automation, observability, and engineering best practices. The role includes ownership of SRE processes, technical direction, and collaboration across product and architecture teams. 🗂️ Requirements: Minimum 3 years of experience managing engineering teams, Experience building or transforming high-performing SRE or Engineering teams, Strong knowledge of automation and scripting/programming, Experience with CI/CD processes, Practical experience with monitoring and observability tools, Experience maintaining high-availability, low-latency systems, Experience working in regulated environments, Experience working in Agile methodologies, Very good command of English 📃 Skills: Python, Go, Bash, CI/CD, Grafana, Splunk, AppDynamics, OpenTelemetry, SRE, Agile 🏢 Description: Team Lead Site Reliability Engineering (SRE) Kraków | Model hybrydowy O stanowisku Poszukujemy doświadczonego Team Leada Site Reliability Engineering, który obejmie odpowiedzialność za rozwój i prowadzenie zespołu SRE wspierającego krytyczne usługi IT oraz platformy biznesowe o globalnym zasięgu. Osoba na tym stanowisku będzie odpowiadać za zarządzanie całym cyklem realizacji zadań zespołu – od przyjmowania i priorytetyzacji zgłoszeń, przez planowanie i realizację prac, aż po raportowanie wyników. Będzie również wyznaczać kierunek techniczny, wdrażać najlepsze praktyki inżynierskie oraz budować kulturę ciągłego doskonalenia. Stanowisko wymaga ścisłej współpracy z architektami, Product Ownerami, zespołami produktowymi oraz operacyjnymi w celu zwiększania niezawodności, wydajności, skalowalności i bezpieczeństwa kluczowych usług IT. Zakres obowiązków Zarządzanie zespołem Site Reliability Engineering odpowiedzialnym za utrzymanie i rozwój krytycznych usług oraz platform wspieranych przez dostawców zewnętrznych. Odpowiedzialność za pełny proces realizacji prac zespołu: od przyjmowania zgłoszeń, przez ich priorytetyzację, po dostarczenie rezultatów i raportowanie. Współpraca z Product Ownerami przy definiowaniu i wdrażaniu wskaźników niezawodności (SLO, SLI, SLA). Wspieranie zespołów produktowych we wdrażaniu najlepszych praktyk SRE i niezawodności w całym cyklu życia oprogramowania. Rozwijanie obszaru monitoringu i obserwowalności systemów oraz wdrażanie narzędzi zwiększających niezawodność i efektywność operacyjną. Definiowanie oraz egzekwowanie standardów inżynierskich i operacyjnych, obejmujących dokumentację, code review, kontrolę jakości i procedury operacyjne. Mentoring, coaching i rozwój kompetencji członków zespołu. Budowanie kultury ciągłego doskonalenia oraz wysokiej jakości dostarczanych rozwiązań. Analiza wyników i wdrażanie usprawnień wpływających na efektywność zespołu i stabilność środowisk produkcyjnych. Wymagania Must have Minimum 3 lata doświadczenia w zarządzaniu zespołami inżynierskimi oraz wyznaczaniu kierunku technicznego w środowisku korporacyjnym. Udokumentowane doświadczenie w budowaniu lub transformacji zespołów w wysokowydajne organizacje SRE lub Engineering. Bardzo dobra znajomość automatyzacji oraz języków skryptowych/programistycznych (Python, Go, Bash lub podobnych). Doświadczenie z procesami CI/CD. Praktyczna znajomość narzędzi monitoringu i observability, takich jak Grafana, Splunk, AppDynamics, OpenTelemetry lub podobnych. Doświadczenie w utrzymaniu systemów o wysokiej dostępności i niskich opóźnieniach w środowiskach regulowanych (np. sektor finansowy, fintech, ubezpieczenia). Silne umiejętności analityczne i rozwiązywania problemów. Bardzo dobra znajomość języka angielskiego w mowie i piśmie. Doświadczenie w pracy zgodnie z metodykami Agile. Umiejętność samodzielnej pracy oraz efektywnej współpracy w międzynarodowym środowisku. Wysoko rozwinięte umiejętności komunikacyjne, dokumentacyjne oraz poczucie odpowiedzialności za dostarczane rozwiązania. Mile widziane Doświadczenie w utrzymaniu i rozwoju usług IT opartych o rozwiązania dostawców zewnętrznych. Znajomość zagadnień związanych z compliance, zarządzaniem ryzykiem oraz regulacjami obowiązującymi w sektorze usług finansowych. Co oferujemy Możliwość realnego wpływu na rozwój i strategię obszaru Site Reliability Engineering. Pracę przy krytycznych systemach o dużej skali i wysokich wymaganiach dotyczących dostępności. Współpracę z międzynarodowymi zespołami ekspertów. Środowisko nastawione na rozwój technologiczny, automatyzację i ciągłe doskonalenie. Atrakcyjne warunki zatrudnienia oraz możliwość rozwoju kariery w organizacji o globalnym zasięgu. Benefity: LuxMed, MyBenefit Co zyskasz dzięki aplikacji na ofertę Antal? Gdy Twoja aplikacja zostanie rozpatrzona pozytywnie (zostaniesz zaproszony/a do procesu), otrzymasz wsparcie Konsultanta/Konsultantki, który/a utrzyma z Tobą stały kontakt (mailowo lub telefonicznie), pomoże Ci przygotować się do rozmowy rekrutacyjnej z przyszłym pracodawcą oraz zatroszczy się o jakość procesu rekrutacyjnego, w którym aktualnie bierzesz udział. Kim jesteśmy? Jesteśmy liderem rekrutacji specjalistów i menedżerów oraz doradztwa w obszarze HR. Marka obecna jest w 35 krajach, w Polsce działa od 1996 roku. Przez ten czas zbudowaliśmy wiele karier kandydatów, dzięki elastycznemu i kompleksowemu podejściu do wszystkich rekrutacji. Antal tworzy ponad 130 profesjonalnych konsultantów ds. rekrutacji, którzy są oni nie tylko skutecznymi rekruterami, ale także wykwalifikowanymi doradcami, specjalizującymi się zarówno w zakresie wybranego sektora, jak i stanowiska. Sprawdź inne aktualne oferty pracy na: https://antal.pl/dla-kandydata Zaobserwuj nasz profil na LinkedIn: https://www.linkedin.com/company/antalpoland

Technology

Antal Sp. z o.o.

Linux Site Reliability Engineer

Senior

Hybrid

Krakow, Poland

🏢 Summary: Hybrid Linux Site Reliability Engineer role focused on developing, maintaining, and securing infrastructure supporting global cybersecurity services across on-premise and cloud environments. The position involves automation, incident management, and ensuring high availability of Linux-based and containerized systems within an Agile environment. The engineer will work with DevSecOps tools and participate in on-call rotations. 🗂️ Requirements: Minimum 5 years experience as DevOps / DevSecOps / SRE, Strong knowledge of Linux (RHEL), Experience with Bash and Python, Experience with Ansible (automation, playbooks), Experience in server and application management, Knowledge of IP networks (routing, firewall, troubleshooting), Knowledge of Incident and Change Management processes, Experience with CI/CD tools, Basic knowledge of PostgreSQL, Ability to work in Agile environment, Willingness to participate in on-call rotation 📃 Skills: Linux, RHEL, Bash, Python, Ansible, Kubernetes, Docker, GCP, Terraform, Vault, Git, Jenkins, GitHub, JIRA, PostgreSQL, Splunk, AppDynamics, Tenable, Nessus, CI/CD, IP, Firewall 🏢 Description: Linux Site Reliability Engineer Tryb pracy: hybrydowy - 6dni/miesiąc w biurze Klienta w Krakowie O roli Dołącz do zespołu naszego Klienta, pracującego nad nowoczesnymi rozwiązaniami w obszarze cyberbezpieczeństwa w skali globalnej. Poszukujemy doświadczonego Site Reliability Engineera, który będzie odpowiedzialny za rozwój, utrzymanie i zabezpieczenie infrastruktury wspierającej zaawansowane usługi bezpieczeństwa IT. Będziesz pracować w środowisku Agile, współtworząc rozwiązania, które zapewniają bezpieczeństwo systemów, aplikacji i danych w środowiskach on-premise oraz chmurowych. Zakres obowiązków Utrzymanie i rozwój infrastruktury wspierającej narzędzia cyberbezpieczeństwa Wsparcie środowiska produkcyjnego (incident management, troubleshooting, wsparcie użytkowników) Automatyzacja procesów i rozwój narzędzi DevSecOps Zarządzanie środowiskami opartymi o Linux (RHEL), Kubernetes i rozwiązania chmurowe Monitorowanie systemów oraz zapewnienie ich wysokiej dostępności Zarządzanie podatnościami i wdrażanie poprawek bezpieczeństwa Tworzenie i utrzymywanie dokumentacji technicznej Współpraca z zespołami IT i bezpieczeństwa na poziomie globalnym Udział w dyżurach on-call (rotacyjnie) Wymagania Min. 5 lat doświadczenia w roli DevOps / DevSecOps/ SRE Bardzo dobra znajomość systemów Linux (RHEL) Doświadczenie w: Bash i Python Ansible (automatyzacja, playbooki) zarządzaniu serwerami i aplikacjami sieciach IP (routing, firewall, troubleshooting) Znajomość procesów Incident & Change Management Doświadczenie z CI/CD (np. GitHub, Jenkins, JIRA) Podstawowa znajomość baz danych (PostgreSQL) Umiejętność pracy w środowisku Agile Mile widziane Doświadczenie z chmurą (szczególnie GCP) Znajomość Docker / Kubernetes Narzędzia: Terraform, HashiCorp Vault, Git Rozwiązania do monitoringu (np. Splunk, AppDynamics) Narzędzia bezpieczeństwa (np. Tenable, Nessus) Doświadczenie w pracy z dużą, rozproszoną infrastrukturą Oferujemy Udział w globalnych projektach z obszaru cyberbezpieczeństwa Możliwość pracy z nowoczesnym stackiem technologicznym Elastyczny model pracy (hybrydowy / zdalny) Konkurencyjne wynagrodzenie i pakiet benefitów Dostęp do szkoleń i programów rozwoju zawodowego Realny wpływ na rozwój rozwiązań i architektury Benefity: opieka medyczna LuxMed, kafeteria MyBenefit Co zyskasz dzięki aplikacji na ofertę Antal? Gdy Twoja aplikacja zostanie rozpatrzona pozytywnie (zostaniesz zaproszony/a do procesu), otrzymasz wsparcie Konsultanta/Konsultantki, który/a utrzyma z Tobą stały kontakt (mailowo lub telefonicznie), pomoże Ci przygotować się do rozmowy rekrutacyjnej z przyszłym pracodawcą oraz zatroszczy się o jakość procesu rekrutacyjnego, w którym aktualnie bierzesz udział. Kim jesteśmy? Jesteśmy liderem rekrutacji specjalistów i menedżerów oraz doradztwa w obszarze HR. Marka obecna jest w 35 krajach, w Polsce działa od 1996 roku. Przez ten czas zbudowaliśmy wiele karier kandydatów, dzięki elastycznemu i kompleksowemu podejściu do wszystkich rekrutacji. Antal tworzy ponad 130 profesjonalnych konsultantów ds. rekrutacji, którzy są oni nie tylko skutecznymi rekruterami, ale także wykwalifikowanymi doradcami, specjalizującymi się zarówno w zakresie wybranego sektora, jak i stanowiska.

Technology

Link Group

DevOps / Site Reliability Engineer

Mid

Hybrid

Kraków, Poland

20,000 - 25,000 PLN

🏢 Summary: DevOps / Site Reliability Engineer role focused on building and maintaining scalable cloud infrastructure while improving platform reliability and automation. The position centers on Kubernetes-based environments, CI/CD pipeline development, and enhancing monitoring and observability. The engineer will support development teams through infrastructure as code and internal developer platform initiatives. 🗂️ Requirements: Experience with cloud platforms (Azure preferred), Strong experience with Kubernetes, Strong knowledge of Infrastructure as Code (Terraform), Hands-on experience with CI/CD tools, Experience with monitoring and observability tools, Understanding of scalability, reliability, and security best practices 📃 Skills: Azure, Kubernetes, Terraform, GitHubActions, ArgoCD, CI/CD, Datadog, Prometheus, Grafana, MongoDB, Rancher, Jenkins, PowerBI, Jira, Confluence 🏢 Description: DevOps / Site Reliability Engineer We’re looking for a DevOps / SRE to help build and maintain scalable cloud infrastructure and improve reliability across our platform. You’ll focus on automation, CI/CD, and supporting development teams with efficient tooling and processes. Key responsibilities Develop and manage cloud infrastructure (Azure preferred) Work with Kubernetes and containerized environments Build and maintain CI/CD pipelines (GitHub Actions, ArgoCD) Automate deployments and operational processes Contribute to Internal Developer Platform (IDP) development Improve monitoring and observability (e.g., Datadog, Prometheus, Grafana) Requirements Experience with cloud platforms and Kubernetes Strong knowledge of Infrastructure as Code (e.g., Terraform) Hands-on experience with CI/CD tools Understanding of scalability, reliability, and security best practices Experience with monitoring/observability tools Nice to have Experience with MongoDB Atlas, Rancher, Jenkins, Power BI Familiarity with Jira, Confluence

Technology

Yard Corporate

Site Reliability Engineer (SRE)

Senior

Hybrid

Warsaw, Poland

40,000 - 55,000 PLN

🏢 Summary: Senior Site Reliability Engineer role focused on building and standardizing SRE practices across a hybrid AWS and on-prem infrastructure. The position centers on ensuring scalability, resilience, and high availability of high-frequency, data-intensive platforms through observability, automation, and Kubernetes optimization. You will define SLOs, enhance monitoring architecture, and drive reliability culture across engineering teams. 🗂️ Requirements: 5+ years experience in SRE, DevOps, or Infrastructure Engineering supporting distributed production systems, Bachelor’s degree in Computer Science, Computer Engineering, or related field (or equivalent experience), Deep expertise in Grafana, Prometheus, Loki, and Tempo (OpenTelemetry), Strong production experience with Docker and Kubernetes, Experience managing hybrid infrastructure (AWS and on-premises), Proficiency in at least one language: Python, Go, or Bash, Hands-on experience with CI/CD pipelines and Infrastructure-as-Code, Experience defining and managing SLOs and SLAs, Willingness to participate in on-call rotation 📃 Skills: AWS, Kubernetes, Docker, Prometheus, Grafana, Loki, Tempo, OpenTelemetry, Python, Go, Bash, CI/CD, IaC, Git, Hypervisors 🏢 Description: About the Client Our client is a premier, global investment management firm operating at the intersection of finance and technology. Known for their sophisticated, data-intensive systems, they build and maintain high-performance platforms that process massive volumes of market and operational data. To support their expanding footprint, they are looking for a senior-level Site Reliability Engineer (SRE) who will take ownership of shaping, standardizing, and scaling their SRE frameworks and reliability culture from the ground up. The Role In this role, you will serve as a foundational force for SRE practices, partnering directly with Cloud, Infrastructure, and Software Engineering squads. You will work across a hybrid infrastructure (combining advanced AWS cloud environments and physical on-premises servers) to guarantee the scalability, resilience, and maximum uptime of critical, high-frequency transactional platforms. Core Responsibilities SRE Evangelism: Design, implement, and champion core reliability principles, helping technology teams adopt sustainable scaling practices. Observability Architecture: Implement, scale, and maintain end-to-end monitoring, telemetry, and distributed tracing systems utilizing Prometheus, Grafana, Loki, and Tempo (OpenTelemetry framework). Kubernetes Optimization: Establish best-practice configurations for containerized workloads, ensuring applications running on Kubernetes are highly resilient, cost-effective, and performant. Incident Management & Culture: Participate in a balanced, shared on-call rotation (averaging one week per month). Automation & Engineering: Build custom tooling and CI/CD pipelines to automate routine tasks, system health checks, and rapid disaster recovery workflows. SLO/SLA Definition: Partner with product and engineering teams to define, monitor, and enforce Service Level Objectives (SLOs) and Error Budgets. What We Look For Experience: 5+ years of hands-on experience in a dedicated SRE, DevOps, or Infrastructure Engineering role supporting complex, distributed production systems. Education: A Bachelor’s degree in Computer Science, Computer Engineering, or a related technical discipline (or equivalent practical experience). Observability Expertise: Deep, subject-matter knowledge of modern monitoring stacks, specifically Grafana, Prometheus, Loki, and Tempo (OTel). Orchestration & Containers: Strong, production-grade expertise in containerization (Docker) and orchestration (Kubernetes). Hybrid Infrastructure: Experience navigating hybrid models—managing both cloud services (AWS preferred) and physical on-premise hardware resources. Scripting/Coding: Proficiency in writing clean, maintainable code in at least one scripting or programming language (e.g., Python, Bash, or Go) to build reliable automation. Methodologies: Solid grounding in CI/CD concepts, infrastructure-as-code (IaC), and agile development processes. Soft Skills: Excellent verbal and written communication skills, with a proven ability to convey complex infrastructure and reliability concepts to both technical and non-technical stakeholders. What We Offer Stable Employment: Full-time employment contract ( Umowa o Pracę - UoP ). Tax Optimization: Eligibility for creative tax-deductible costs ( KUP - Koszty Uzyskania Przychodu). Financial Reward: Highly competitive base salary accompanied by a generous annual performance bonus . Comprehensive Health: Premium private medical care package that fully includes dental coverage (stomatologia) . Wellness & Lifestyle: MultiSport card to keep you active and healthy. Daily Perks: Pre-funded lunch card for your daily meals. Tech Stack at a Glance Cloud & Virtualization: AWS, Kubernetes, Docker, On-Premises Hypervisors Observability: Prometheus, Grafana, Loki, Tempo, OpenTelemetry (OTel) Languages: Python, Go, Bash CI/CD & Automation: Git-based pipelines, Configuration Management, IaC

Technology

Antal Sp. z o.o.

Site Reliability Engineer

Senior

Hybrid

Krakow, Poland

180 - 220 PLN

🏢 Summary: Site Reliability Engineer role focused on supporting and enhancing a next-generation Counterparty Credit Risk platform built on microservices and hybrid cloud infrastructure. The position involves ensuring reliability, scalability and performance of distributed systems while contributing to cloud migration and DevOps transformation. The engineer will manage production incidents, improve observability and support CI/CD and automation initiatives. 🗂️ Requirements: 4+ years experience with distributed systems in Java environments, Experience in application support and production incident management, Strong troubleshooting and analytical skills, Experience with CI/CD tools (Jenkins, Ansible, JIRA, Confluence), Knowledge of monitoring and logging tools (Grafana, Prometheus, InfluxDB, Splunk or similar), Basic knowledge of cloud platforms (GCP preferred), Knowledge of relational databases (Oracle, PostgreSQL), Experience with disaster recovery processes, Familiarity with Unix/Linux environments, Experience working with cross-platform systems (Java/Python) 📃 Skills: Java, Spring, GCP, Redis, REST, Ansible, Jenkins, Grafana, Prometheus, InfluxDB, Splunk, Loki, Oracle, PostgreSQL, Linux, Apache, Beam, Flink, JIRA, Confluence 🏢 Description: Site Reliability Engineer Department: Market Securities & Services Hybrid working model (2 days per week in Kraków office) For our Client – a leading international financial institution and one of the largest investment banks globally – we are looking for a Site Reliability Engineer to join the Market Securities & Services IT division. The role sits within the Counterparty Credit Risk (CCR) Technology team , responsible for delivering critical risk calculation platforms used globally. The team is currently building the next generation of Counterparty Credit Risk Engines, including cloud migration and development of in-house analytical libraries to replace vendor solutions. This is a unique opportunity to join a growing engineering team in Kraków and contribute to a strategic, multi-year transformation programme. About the Team & Technology Landscape The new CCR platform is based on microservices architecture and leverages modern open-source technologies. It runs across Google Cloud Platform and on-premise infrastructure. Technologies include: Java SE, Spring Boot, Spring Cloud, Apache Beam, Apache Flink, GCP, Redis, REST APIs, Ansible, Jenkins. The organisation is heavily investing in Agile ways of working, DevOps practices, CI/CD pipelines, and Cloud technologies. Your Responsibilities Manage application support operations with focus on resiliency, availability and performance Coordinate production incident resolution and conduct post-mortems / root cause analysis Investigate and resolve complex production issues across distributed systems Contribute to continuous service improvement and knowledge base documentation Actively engage in Incident, Problem and Service Management processes Apply SRE principles to enhance reliability, scalability and observability Develop and improve monitoring, alerting and incident detection mechanisms Support hybrid cloud environments and automation initiatives Work in a 2-shift rotation (8:00 AM start / 4:00 PM start) Participate in weekend and on-call rotations What We Are Looking For 4+ years of experience supporting and/or developing distributed systems (Java-based environments) Strong troubleshooting and analytical skills Experience with disaster recovery processes Hands-on experience with application lifecycle and CI/CD tooling (JIRA, Confluence, Jenkins, Ansible) Experience supporting complex, cross-platform systems (Java / Python environments) Knowledge of Agile/Kanban delivery models Experience implementing monitoring and logging frameworks (e.g. Grafana, InfluxDB, Prometheus, Splunk, Loki or similar) Basic knowledge of relational databases (Oracle, PostgreSQL) Understanding of cloud platforms (preferably GCP) Familiarity with Unix/Linux environments Ability to lead technical discussions with global support teams Strong communication skills and ability to work across regions Technical Requirements Core Java knowledge Application support experience Monitoring tools (Grafana, InfluxDB, Prometheus or similar) Basic cloud knowledge (GCP preferred) Automation tools (Jenkins, Ansible) Knowledge of relational databases (Oracle, PostgreSQL) Why Apply? Opportunity to work on business-critical global risk platforms Participation in a large-scale cloud and architecture transformation Modern technology stack and DevOps culture Hybrid working model (2 days per week in Kraków office) Long-term project within a stable, global financial environment Why apply for an Antal job offer? When your application is successful, you will be supported by a dedicated Consultant who will stay in regular contact with you (via email or phone), help you prepare for interviews with your future employer, and ensure a smooth and professional recruitment process. About Antal Antal is a leading recruitment and HR advisory company, present in Poland since 1996 and later expanded to the Czech Republic and Hungary. Across the CEE region, we employ around 150 professionals who deliver a full range of services – from specialist and executive recruitment, employee outsourcing and HR consulting, to employer branding and market research. Our division-based structure combines deep industry expertise with functional specialisation, enabling us to provide tailored solutions for companies in every sector. We act as a trusted partner for both employers and candidates, sharing our knowledge and guiding them through every stage of the talent journey. We connect exceptional people with the right opportunities and help organisations build successful teams.

Antal Sp. z o.o.

Antal Sp. z o.o. is a leading company in the recruitment and HR consulting industry. Established in Poland in 1996, Antal has expanded its presence to 35 countries, demonstrating a strong international footprint. The company is renowned for its flexible and comprehensive approach to recruitment, having successfully built numerous careers over the years. Antal employs over 130 professional recruitment consultants who are not only effective recruiters but also skilled advisors specializing in specific sectors and positions. The company's mission revolves around providing high-quality recruitment services and HR advice, ensuring a seamless and effective hiring process for both candidates and employers.

Check if your resume is ATS-ready before applying →Build an ATS-optimized resume