April 29, 2026

Senior Site Reliability Engineer - Azure

Senior • Remote

About Hashgraph:

Hashgraph is a fast-growing software company committed to supporting, developing and servicing Hedera, an open source, proof-of-stake platform. Hedera is EVM-compatible and has been specifically built to meet the needs of enterprise and web3 applications, which require speed, security, stability and sustainability. Hedera's public network is governed by industry-leading organizations, spanning 11 sectors and 14 regions who oversee the development and direction of the decentralized platform.

The role:

We are hiring a Senior Site Reliability Engineer (Azure) to build and scale the Azure infrastructure foundation for HashSphere, a new private DLT network harnessing Hedera's institutional grade technology, being built by a passionate team of industry leaders.This role exists to ensure that our platform can operate as a secure, scalable, and production-ready system in Azure, supporting complex enterprise use cases and high reliability expectations.

The impact you'll have:

In this role, you will:

  • Design and build secure, scalable Azure infrastructure from first principles for a production-grade distributed system
  • Develop and own Terraform-based infrastructure as code, enabling repeatable and automated deployments
  • Translate product and customer requirements into technical architecture and execution plans
  • Build and enhance platform services, APIs, and integrations that extend HashSphere capabilities
  • Partner across engineering, security, and product teams to deliver enterprise-ready infrastructure solutions
  • Contribute to operational excellence, including reliability, observability, and incident response
  • Support customer deployments and production environments through Tier 2 infrastructure support

What success looks like in 6-12 months:

  • Azure is a production-ready deployment environment for HashSphere
  • Customer deployments are repeatable, scalable, and secure
  • Azure achieves feature parity with other supported cloud environments

What you bring:

Core capabilities:

  • Proven experience designing and building production-grade systems on Azure
  • Ability to take ambiguous requirements to structured technical solutions to delivered systems
  • Strong technical communication skills across engineering and non-technical stakeholders
  • High ownership mindset with a bias for action and accountability
  • Collaborative approach with a focus on building durable, scalable solutions

Functional expertise:

  • Azure cloud services (networking, compute, identity, security, storage)
  • Terraform (infrastructure as code at production scale)
  • Programming experience in Go and/or Python
  • Experience building greenfield infrastructure environments
  • Distributed systems, high-availability architectures, or platform engineering
  • CI/CD and automation tooling for infrastructure lifecycle management

Nice to haves:

  • Kubernetes and container orchestration
  • Observability tooling (Prometheus, Grafana)
  • Workflow/orchestration platforms (Argo, Spacelift, or similar)

Similar jobs you might like

Technology

Hashgraph

Senior Site Reliability Engineer - Azure

Senior

On-site

San Francisco, CA

🏢 Summary: Senior Site Reliability Engineer (Azure) role focused on building and scaling secure, production-grade Azure infrastructure for a private distributed ledger network. The position centers on designing infrastructure from scratch, implementing infrastructure as code, and ensuring high availability, reliability, and enterprise readiness. The engineer will support complex deployments and operational excellence in a distributed systems environment. 🗂️ Requirements: Proven experience designing and building production-grade systems on Azure, Strong expertise in Azure cloud services: networking, compute, identity, security, storage, Hands-on experience with Terraform for infrastructure as code at production scale, Programming experience in Go or Python, Experience with distributed systems and high-availability architectures, Experience with CI/CD and infrastructure automation tooling, Ability to design greenfield infrastructure environments 📃 Skills: Azure, Terraform, Go, Python, CI/CD, Kubernetes, Prometheus, Grafana, Argo, Spacelift 🏢 Description: About Hashgraph: Hashgraph is a fast-growing software company committed to supporting, developing and servicing Hedera, an open source, proof-of-stake platform. Hedera is EVM-compatible and has been specifically built to meet the needs of enterprise and web3 applications, which require speed, security, stability and sustainability. Hedera's public network is governed by industry-leading organizations, spanning 11 sectors and 14 regions who oversee the development and direction of the decentralized platform.The role:We are hiring a Senior Site Reliability Engineer (Azure) to build and scale the Azure infrastructure foundation for HashSphere, a new private DLT network harnessing Hedera's institutional grade technology, being built by a passionate team of industry leaders.This role exists to ensure that our platform can operate as a secure, scalable, and production-ready system in Azure, supporting complex enterprise use cases and high reliability expectations.The impact you'll have: In this role, you will: Design and build secure, scalable Azure infrastructure from first principles for a production-grade distributed system Develop and own Terraform-based infrastructure as code, enabling repeatable and automated deployments Translate product and customer requirements into technical architecture and execution plans Build and enhance platform services, APIs, and integrations that extend HashSphere capabilities Partner across engineering, security, and product teams to deliver enterprise-ready infrastructure solutions Contribute to operational excellence, including reliability, observability, and incident response Support customer deployments and production environments through Tier 2 infrastructure support What success looks like in 6-12 months: Azure is a production-ready deployment environment for HashSphere Customer deployments are repeatable, scalable, and secure Azure achieves feature parity with other supported cloud environments What you bring: Core capabilities: Proven experience designing and building production-grade systems on Azure Ability to take ambiguous requirements to structured technical solutions to delivered systems Strong technical communication skills across engineering and non-technical stakeholders High ownership mindset with a bias for action and accountability Collaborative approach with a focus on building durable, scalable solutions Functional expertise: Azure cloud services (networking, compute, identity, security, storage) Terraform (infrastructure as code at production scale) Programming experience in Go and/or Python Experience building greenfield infrastructure environments Distributed systems, high-availability architectures, or platform engineering CI/CD and automation tooling for infrastructure lifecycle management Nice to haves: Kubernetes and container orchestration Observability tooling (Prometheus, Grafana) Workflow/orchestration platforms (Argo, Spacelift, or similar)

Technology

Hashgraph

Senior DevOps Engineer

Senior

On-site

Seattle, WA

🏢 Summary: Senior DevOps Engineer (Node Operations) responsible for ensuring reliability, security, and automation of Hedera consensus node environments across testnet and preproduction networks. The role focuses on infrastructure automation, Infrastructure-as-Code on GCP, and improving release and operational processes for a globally distributed system. It includes operating production-grade distributed systems and driving incident response and reliability improvements. 🗂️ Requirements: Strong Linux administration in production environments, Advanced networking troubleshooting skills, Experience with Terraform for Infrastructure-as-Code, Experience with Ansible for configuration management, Experience with CI/CD pipeline automation (e.g., Jenkins), Experience operating distributed systems at scale, Hands-on experience with GCP infrastructure, Knowledge of Kubernetes fundamentals, Experience in incident response and RCA, Ability to automate operational workflows 📃 Skills: Terraform, GCP, Ansible, Jenkins, Linux, Networking, Kubernetes, CICD, IaC, DistributedSystems, IncidentResponse, RCA, Automation 🏢 Description: About Hashgraph: Hashgraph is a fast-growing software company committed to supporting, developing and servicing Hedera, an open source, proof-of-stake platform. Hedera is EVM-compatible and has been specifically built to meet the needs of enterprise and web3 applications, which require speed, security, stability and sustainability. Hedera's public network is governed by industry-leading organizations, spanning 11 sectors and 14 regions who oversee the development and direction of the decentralized platform.The role: We are hiring a Senior DevOps Engineer (Node Operations) to ensure the reliability, security, and operational excellence of Hedera consensus node environments. This role exists to reduce operational toil, strengthen infrastructure automation, and improve release and preproduction readiness across a globally distributed network. Without this role, we risk increased availability incidents, slower recovery times, and delays in delivering against the product roadmap. In this role, you will: Operate and improve Hedera consensus node environments across testnet, previewnet, and preproduction Design and implement automation-first workflows for release and preproduction environments Build and maintain Infrastructure-as-Code (Terraform) on GCP Improve change management, release safety, and operational predictability Participate in on-call rotation, incident response, and RCA, driving corrective actions into automation Partner with internal engineering teams and external stakeholders, including Hedera Governing Council members, to support operational requirements What success looks like in 6-12 months: Operational toil is significantly reduced through durable automation and standardization Node environments are more reliable, with fewer incidents and faster recovery times Release and preproduction workflows are predictable, repeatable, and automated Infrastructure changes are consistent, testable, and auditable through IaC best practices What you bring: Core capabilities: Strong systems reliability mindset with experience in incident response and RCA Proven ability to automate operational workflows and reduce manual toil Clear communicator with the ability to work across engineering, security, and external partners Deep ownership mentality with a bias toward preventative engineering over reactive fixes Strong Linux and networking troubleshooting in production environments Functional expertise: Infrastructure-as-Code with Terraform (module design, state management) Configuration management with Ansible CI/CD automation (Jenkins or equivalent pipeline tooling) Experience operating distributed systems or production infrastructure at scale Familiarity with Kubernetes fundamentals Nice to haves: Observability stacks (e.g., Grafana, Loki, Tempo, Mimir) Programming/scripting (Go, Python, Bash) GitHub/GitHub Actions experience

Technology

Hashgraph

Staff Software Engineer - CI/CD & Release Engineering

Senior

On-site

New York City, NY

🏢 Summary: Staff Software Engineer role focused on architecting and scaling CI/CD and release engineering systems to support complex, multi-product software delivery. The position centers on building reliable, automated, and observable release infrastructure using cloud-native and Kubernetes-based environments. It aims to improve deployment speed, safety, and operational excellence across internal and open-source platforms. 🗂️ Requirements: 7+ years building and operating production-grade CI/CD or engineering infrastructure, Deep experience with CI/CD systems (GitHub Actions or GitLab CI), Strong experience with Kubernetes and Docker, Strong programming skills in Kotlin, Java, Go, Python, or Bash, Experience building automated build, test, release, and deployment pipelines, Experience with Gradle, Maven, or NPM, Experience operating infrastructure in AWS or Google Cloud Platform, Experience implementing observability and monitoring solutions, Strong understanding of release management and deployment reliability practices 📃 Skills: Kubernetes, Docker, GitHubActions, GitLabCI, Kotlin, Java, Go, Python, Bash, Gradle, Maven, NPM, AWS, GCP, Grafana 🏢 Description: About Hashgraph: Hashgraph is a fast-growing software company committed to supporting, developing and servicing Hedera, an open source, proof-of-stake platform. Hedera is EVM-compatible and has been specifically built to meet the needs of enterprise and web3 applications, which require speed, security, stability and sustainability. Hedera's public network is governed by industry-leading organizations, spanning 11 sectors and 14 regions who oversee the development and direction of the decentralized platform.The role: We are hiring a Staff Software Engineer - CI/CD & Release Engineering to architect and scale the systems that power software delivery across Hashgraph's internal and open-source platforms. This role exists because our engineering organization is operating increasingly complex distributed systems across multiple products, environments, and release streams. We need a senior technical builder who can design reliable, scalable, and developer-friendly release infrastructure that enables engineering teams to ship quickly, safely, and repeatedly at production scale. You will help establish the foundation for how software is built, tested, released, deployed, and observed across the company — improving engineering velocity while strengthening reliability and operational confidence. In this role, you will: Architect and evolve scalable CI/CD systems that support complex multi-product release workflows across internal and open-source platforms Design and build developer tooling, deployment automation, and release orchestration systems using technologies such as GitHub Actions, Kubernetes, and cloud-native infrastructure Lead the engineering strategy for build pipelines, artifact management, release governance, and deployment reliability Improve developer experience by reducing friction in build, test, deployment, and operational workflows Build and maintain highly reliable Kubernetes-based infrastructure powering release automation and engineering productivity systems Drive observability and operational excellence across release systems through metrics, monitoring, and performance instrumentation Partner closely with Platform Engineering, DevOps, Security, Program Management, and Product Engineering teams to align release infrastructure with business and technical priorities Serve as a senior technical leader and multiplier within the organization through architecture guidance, mentorship, and operational rigor What success looks like in 6-12 months: CI/CD systems are highly reliable, scalable, and capable of supporting rapid multi-product releases with minimal operational overhead Engineering teams ship software faster with improved deployment confidence and reduced release friction Build and release infrastructure is standardized, observable, and repeatable across products and environments Deployment automation significantly reduces manual operational effort and release risk Release engineering becomes a strategic enabler for engineering throughput, platform reliability, and product delivery velocity What you bring: Core capabilities: Strong systems thinking with the ability to design scalable engineering infrastructure from first principles Deep ownership mentality with a bias toward automation, reliability, and operational excellence Strong communication and cross-functional collaboration skills across engineering and business stakeholders Ability to balance strategic architectural thinking with hands-on execution Proven ability to operate effectively in fast-moving, high-autonomy engineering environments Functional expertise: 7+ years building and operating production-grade software engineering infrastructure or CI/CD platforms Deep experience designing and operating CI/CD systems using tools such as GitHub Actions and/or GitLab CI Strong experience with Kubernetes, Docker, and cloud-native infrastructure patterns Strong programming and automation experience in one or more of: Kotlin, Java, Go, Python, or Bash Experience building scalable build, test, release, and deployment automation systems Experience with modern build systems and dependency management tools such as Gradle, Maven, or NPM Experience operating engineering infrastructure within AWS and/or Google Cloud Platform environments Experience designing observability and monitoring solutions using tools such as Grafana and related telemetry systems Strong understanding of release management strategy, deployment safety, and operational reliability practices Experience writing high-quality technical documentation, engineering standards, and operational processes Nice to haves: Experience supporting large-scale open-source software development workflows Experience with distributed systems, blockchain, or decentralized infrastructure technologies Experience optimizing build performance, release throughput, or engineering productivity at scale Expertise in Golang and advanced shell automation

Technology

Cyclad

Senior Site Reliability Engineer (Azure & OpenShift)

Senior

Hybrid

Warsaw, Poland

130 - 140 PLN

🏢 Summary: Senior Site Reliability Engineer role focused on operating and optimizing Azure and OpenShift environments, ensuring reliability, scalability, and automation across staging and production systems. The position involves managing CI/CD, implementing Infrastructure as Code, enhancing observability, and leading incident management in a hybrid cloud setup. This is a hybrid role in Warsaw with collaboration across engineering teams to embed SRE and DevOps best practices. 🗂️ Requirements: 3+ years in SRE, DevOps, or Platform Engineering, Strong production experience with Microsoft Azure, Strong experience with OpenShift and Kubernetes, Experience with CI/CD pipelines and GitOps, Hands-on experience with Terraform, Experience with monitoring and logging tools, Scripting skills (Bash or similar), Knowledge of observability and reliability engineering principles, Experience in incident management and root cause analysis, Focus on automation and system stability 📃 Skills: Azure, OpenShift, Kubernetes, Terraform, ArgoCD, Jenkins, GitHubActions, Bash, Prometheus, Grafana, ELK, EFK, AzureMonitor, GitOps, CI/CD 🏢 Description: In Cyclad we work with top international IT companies in order to boost their potential in delivering outstanding, cutting edge technologies that shape the world of the future. Currently, we are looking for experienced Senior Site Reliability Engineer (Azure & OpenShift) to join our team. Project information: Location: Warsaw (hybrid) Type of employment: B2B contract or standard employment contract Project languages: English Key Responsibilities: Own and operate staging and production environments in Microsoft Azure Manage and support application deployments on OpenShift (on-prem and Azure) Support and optimize CI/CD pipelines and enable GitOps practices (e.g., ArgoCD) Ensure system reliability through SLIs, SLOs, and continuous improvement of service health Design, implement, and maintain observability solutions (monitoring, logging, alerting) using tools such as Prometheus, Grafana, Azure Monitor, and ELK/EFK Troubleshoot issues across infrastructure, platform (Azure/OpenShift), applications, and deployments Lead incident management, including root cause analysis (RCA), MTTR reduction, and prevention of recurring issues Build and maintain Infrastructure as Code using Terraform and drive automation to reduce operational toil Improve deployment reliability, release processes, and overall system resilience Collaborate with development teams to embed reliability into design, delivery, and operational practices Maintain and improve operational documentation, including runbooks and procedures Ensure performance, scalability, cost efficiency, security, and compliance of cloud infrastructure Advocate for SRE best practices and a DevOps culture across engineering teams Requirements: 3+ years of experience in SRE, DevOps, or Platform Engineering roles Strong experience with Microsoft Azure in production environments Strong experience with OpenShift Container Platform (OCP) and Kubernetes Experience with CI/CD pipelines (e.g., ArgoCD, Jenkins, GitHub Actions) and container-based deployments Strong understanding of observability, incident management, and reliability engineering principles Hands-on experience with Infrastructure as Code (Terraform) Scripting experience (Bash or similar) Experience with monitoring and logging tools (Prometheus, Grafana, ELK/EFK) Strong focus on automation, system stability, and continuous improvement We offer: Private medical care with dental care (covering 70% of costs). Family package option possible Multisport card (also for an accompanying person) Life insurance Work with talented engineers on large-scale, technically challenging projects

Technology

Link Group

Senior Azure DevOps Engineer

Senior

Remote

Bialystok, Poland

140 - 155 PLN

🏢 Summary: Design, deploy, and maintain high-availability Azure cloud environments with a strong focus on AKS and Infrastructure as Code using Terraform. The role centers on secure, scalable, and well-monitored Azure infrastructure, including networking, identity, databases, and disaster recovery. You will drive automation and operational excellence across the Azure ecosystem. 🗂️ Requirements: Experience managing Azure Kubernetes Service (AKS) clusters, Proficiency in Infrastructure as Code using Terraform, Experience with YAML and Helm for Kubernetes deployments, Administration of Azure networking components (VNETs, NSGs), Management of Azure VMs, Storage Accounts, and ACR, Implementation of Azure AD (Entra ID) and IAM policies, Administration of Azure SQL and SQL Server environments, Configuration of monitoring with Azure Monitor, App Insights, and Log Analytics, Design and implementation of disaster recovery and backup strategies 📃 Skills: Azure, AKS, Kubernetes, Terraform, YAML, Helm, VNET, NSG, ACR, AzureAD, IAM, AzureSQL, SQLServer, AzureMonitor, AppInsights, LogAnalytics, Velero 🏢 Description: Role Overview We are looking for a highly skilled Azure Cloud & Platform Engineer to join our infrastructure team. In this role, you will be responsible for designing, deploying, and maintaining high-availability cloud environments with a heavy focus on container orchestration ( AKS ) and Infrastructure as Code ( Terraform ). You will ensure that our Azure ecosystem is secure, scalable, and monitored to the highest standards. Key Responsibilities Kubernetes Orchestration: Manage and optimize Azure Kubernetes Services (AKS) , including cluster configuration, scaling, and lifecycle management. Infrastructure as Code (IaC): Develop and maintain automated infrastructure deployments using Terraform , YAML , and Helm charts. Cloud Administration: Oversee core Azure resources including Networking (VNETs, NSGs), Storage Accounts, Azure VMs, and Container Registries (ACR). Security & Identity: Implement and manage Azure Active Directory (Azure AD/Entra ID) and Identity & Access Management (IAM) policies to ensure a "least privilege" environment. Database Management: Administer Azure SQL environments, including SQL Server, individual databases, and Elastic Pools. Observability & Monitoring: Set up and maintain robust monitoring solutions using Azure Monitor, App Insights, and Log Analytics . Disaster Recovery: Design and implement Disaster Recovery (DR) mechanisms and backup strategies (e.g., using Velero ). Technical Documentation: Create and maintain comprehensive documentation for system configurations, architecture setups, and operational procedures. Preferred Skills Experience with Velero for Kubernetes backups. Knowledge of the ELK Stack (ElasticSearch, Logstash, Kibana). Experience with Open Source monitoring tools: Prometheus, Grafana, and Loki . Familiarity with Ansible for configuration management. Exposure to Apache Kafka messaging systems. Candidate Profile The ideal candidate is a proactive engineer who prioritizes automation over manual intervention. You should be comfortable working in a fast-paced environment, taking ownership of cloud resources, and ensuring that all solutions are documented and resilient. Your approach should combine technical depth in Azure with a broader understanding of DevOps best practices.

Technology

Link Group

GitHub Platform Engineer

Senior

Remote

Warsaw, Poland

30,000 - 33,500 PLN

🏢 Summary: Senior GitHub Platform Engineer responsible for architecting, securing, and governing a GitHub Enterprise Cloud environment at scale. The role focuses on identity integration, security compliance, automation, and infrastructure as code to enable secure and efficient developer workflows. You will standardize GitHub Actions, manage runner infrastructure, and drive platform automation using APIs and scripting. 🗂️ Requirements: Extensive experience administering GitHub Enterprise Cloud at scale, Strong expertise in Enterprise Managed Users (EMU) and RBAC, Hands-on integration with Microsoft Entra ID using SAML, OIDC, and SCIM, Proficiency with GitHub Advanced Security including CodeQL and Dependabot, Experience designing and securing GitHub Actions workflows, Advanced scripting skills in Python, Bash, or PowerShell, Experience automating via GitHub REST and GraphQL APIs, Experience managing infrastructure using Terraform or Bicep, Knowledge of audit logging and SOX compliance controls 📃 Skills: GitHub, EMU, RBAC, Entra, SAML, OIDC, SCIM, GHAS, CodeQL, Dependabot, Actions, Terraform, Bicep, Python, Bash, PowerShell, REST, GraphQL, SOX, Azure, AWS 🏢 Description: Senior GitHub Platform Engineer We are seeking a Senior GitHub Platform Engineer to serve as the primary architect and guardian of our GitHub Enterprise Cloud ecosystem. In this role, you will bridge the gap between Cloud Operations and Developer Experience, ensuring our platform is secure, automated, and scalable. The Role As the Platform Owner, you will be responsible for the governance, security, and continuous improvement of GitHub Enterprise. You will lead identity integration, enforce enterprise standards, and empower our engineering teams through automation and AI-driven workflows. Key Responsibilities Platform Governance: Administer GitHub Enterprise Cloud settings, organizations, and licensing while establishing operational standards for backup and recovery. Identity & Access: Lead the implementation of Enterprise Managed Users (EMU) integrated with Microsoft Entra ID (SAML/OIDC, SCIM). Security & Compliance: Manage GitHub Advanced Security (GHAS) , including CodeQL, Secret Scanning, and Dependabot. Ensure audit log streaming and compliance with regulatory standards (SOX). Automation: Eliminate manual tasks using GitHub REST/GraphQL APIs and scripting (Python, PowerShell, or Bash). Developer Enablement: Standardize GitHub Actions (reusable workflows), manage self-hosted runner infrastructure, and oversee GitHub Copilot adoption. Infrastructure as Code: Maintain and deploy platform configurations using Terraform or Bicep. Required Qualifications Expertise: Extensive experience administering GitHub Enterprise Cloud at scale. Identity: Deep understanding of EMU, RBAC, and IdP integration (Microsoft Entra ID). Security: Hands-on proficiency with GHAS toolsets and branch protection strategies. DevOps: Strong background in GitHub Actions, secure workflow design, and runner governance. Scripting: Advanced skills in Python, Bash, or PowerShell with heavy API usage. IaC: Experience managing infrastructure via Terraform or similar tools. Preferred Skills Experience in regulated industries (FinTech, Gaming, or Government). Familiarity with Azure or AWS integration patterns. Experience supporting audit readiness and software delivery controls.

Technology

Link Group

DevOps / Site Reliability Engineer

Mid

Hybrid

Kraków, Poland

20,000 - 25,000 PLN

🏢 Summary: DevOps / Site Reliability Engineer role focused on building and maintaining scalable cloud infrastructure while improving platform reliability and automation. The position centers on Kubernetes-based environments, CI/CD pipeline development, and enhancing monitoring and observability. The engineer will support development teams through infrastructure as code and internal developer platform initiatives. 🗂️ Requirements: Experience with cloud platforms (Azure preferred), Strong experience with Kubernetes, Strong knowledge of Infrastructure as Code (Terraform), Hands-on experience with CI/CD tools, Experience with monitoring and observability tools, Understanding of scalability, reliability, and security best practices 📃 Skills: Azure, Kubernetes, Terraform, GitHubActions, ArgoCD, CI/CD, Datadog, Prometheus, Grafana, MongoDB, Rancher, Jenkins, PowerBI, Jira, Confluence 🏢 Description: DevOps / Site Reliability Engineer We’re looking for a DevOps / SRE to help build and maintain scalable cloud infrastructure and improve reliability across our platform. You’ll focus on automation, CI/CD, and supporting development teams with efficient tooling and processes. Key responsibilities Develop and manage cloud infrastructure (Azure preferred) Work with Kubernetes and containerized environments Build and maintain CI/CD pipelines (GitHub Actions, ArgoCD) Automate deployments and operational processes Contribute to Internal Developer Platform (IDP) development Improve monitoring and observability (e.g., Datadog, Prometheus, Grafana) Requirements Experience with cloud platforms and Kubernetes Strong knowledge of Infrastructure as Code (e.g., Terraform) Hands-on experience with CI/CD tools Understanding of scalability, reliability, and security best practices Experience with monitoring/observability tools Nice to have Experience with MongoDB Atlas, Rancher, Jenkins, Power BI Familiarity with Jira, Confluence

Technology

Yard Corporate

Site Reliability Engineer (SRE)

Senior

Hybrid

Warsaw, Poland

40,000 - 55,000 PLN

🏢 Summary: Senior Site Reliability Engineer role focused on building and standardizing SRE practices across a hybrid AWS and on-prem infrastructure. The position centers on ensuring scalability, resilience, and high availability of high-frequency, data-intensive platforms through observability, automation, and Kubernetes optimization. You will define SLOs, enhance monitoring architecture, and drive reliability culture across engineering teams. 🗂️ Requirements: 5+ years experience in SRE, DevOps, or Infrastructure Engineering supporting distributed production systems, Bachelor’s degree in Computer Science, Computer Engineering, or related field (or equivalent experience), Deep expertise in Grafana, Prometheus, Loki, and Tempo (OpenTelemetry), Strong production experience with Docker and Kubernetes, Experience managing hybrid infrastructure (AWS and on-premises), Proficiency in at least one language: Python, Go, or Bash, Hands-on experience with CI/CD pipelines and Infrastructure-as-Code, Experience defining and managing SLOs and SLAs, Willingness to participate in on-call rotation 📃 Skills: AWS, Kubernetes, Docker, Prometheus, Grafana, Loki, Tempo, OpenTelemetry, Python, Go, Bash, CI/CD, IaC, Git, Hypervisors 🏢 Description: About the Client Our client is a premier, global investment management firm operating at the intersection of finance and technology. Known for their sophisticated, data-intensive systems, they build and maintain high-performance platforms that process massive volumes of market and operational data. To support their expanding footprint, they are looking for a senior-level Site Reliability Engineer (SRE) who will take ownership of shaping, standardizing, and scaling their SRE frameworks and reliability culture from the ground up. The Role In this role, you will serve as a foundational force for SRE practices, partnering directly with Cloud, Infrastructure, and Software Engineering squads. You will work across a hybrid infrastructure (combining advanced AWS cloud environments and physical on-premises servers) to guarantee the scalability, resilience, and maximum uptime of critical, high-frequency transactional platforms. Core Responsibilities SRE Evangelism: Design, implement, and champion core reliability principles, helping technology teams adopt sustainable scaling practices. Observability Architecture: Implement, scale, and maintain end-to-end monitoring, telemetry, and distributed tracing systems utilizing Prometheus, Grafana, Loki, and Tempo (OpenTelemetry framework). Kubernetes Optimization: Establish best-practice configurations for containerized workloads, ensuring applications running on Kubernetes are highly resilient, cost-effective, and performant. Incident Management & Culture: Participate in a balanced, shared on-call rotation (averaging one week per month). Automation & Engineering: Build custom tooling and CI/CD pipelines to automate routine tasks, system health checks, and rapid disaster recovery workflows. SLO/SLA Definition: Partner with product and engineering teams to define, monitor, and enforce Service Level Objectives (SLOs) and Error Budgets. What We Look For Experience: 5+ years of hands-on experience in a dedicated SRE, DevOps, or Infrastructure Engineering role supporting complex, distributed production systems. Education: A Bachelor’s degree in Computer Science, Computer Engineering, or a related technical discipline (or equivalent practical experience). Observability Expertise: Deep, subject-matter knowledge of modern monitoring stacks, specifically Grafana, Prometheus, Loki, and Tempo (OTel). Orchestration & Containers: Strong, production-grade expertise in containerization (Docker) and orchestration (Kubernetes). Hybrid Infrastructure: Experience navigating hybrid models—managing both cloud services (AWS preferred) and physical on-premise hardware resources. Scripting/Coding: Proficiency in writing clean, maintainable code in at least one scripting or programming language (e.g., Python, Bash, or Go) to build reliable automation. Methodologies: Solid grounding in CI/CD concepts, infrastructure-as-code (IaC), and agile development processes. Soft Skills: Excellent verbal and written communication skills, with a proven ability to convey complex infrastructure and reliability concepts to both technical and non-technical stakeholders. What We Offer Stable Employment: Full-time employment contract ( Umowa o Pracę - UoP ). Tax Optimization: Eligibility for creative tax-deductible costs ( KUP - Koszty Uzyskania Przychodu). Financial Reward: Highly competitive base salary accompanied by a generous annual performance bonus . Comprehensive Health: Premium private medical care package that fully includes dental coverage (stomatologia) . Wellness & Lifestyle: MultiSport card to keep you active and healthy. Daily Perks: Pre-funded lunch card for your daily meals. Tech Stack at a Glance Cloud & Virtualization: AWS, Kubernetes, Docker, On-Premises Hypervisors Observability: Prometheus, Grafana, Loki, Tempo, OpenTelemetry (OTel) Languages: Python, Go, Bash CI/CD & Automation: Git-based pipelines, Configuration Management, IaC