April 24, 2026

Senior Site Reliability Engineer

Senior • Hybrid • On-site

Dallas, TX

About Lantern

Lantern is the specialty care platform connecting people with the best care when they need it most. By curating a Network of Excellence comprised of the nation's top specialists for surgery, cancer care, infusions and more, Lantern delivers excellent care with significant cost savings to employers and their workforces. Lantern also pairs members with a dedicated care team, including Care Advocates and nurses, for the entirety of their care journey, helping them get back to good health, back to their families and back to work. With convenient access to specialists nationwide, Lantern means quality care is within driving distance for most. Lantern is trusted by the nation's largest employers to deliver care to more than 6 million members across the country. Learn more about us at lanterncare.com.

About You:

You use LOGIC in your decision making and understand that progress is critical to making change. You focus on the execution of your content while balancing a fast-paced environment and you take the time to celebrate both the small & big wins.
INCLUSION is a core tenant of your personal beliefs. A diverse and inclusive environment is incredibly important to you. You understand and desire to be a part of a diverse team with different experiences and perspectives & you cherish the differences in each individual that you interact with.
You have the GRIT, drive and ambition to tackle big problems. Big problems require big ideas and a team that supports new ideas.
You care deeply for your customers are driven to keep HUMANITY in all decisions. Your customers aren’t just the individuals using your product. They are the driving factor in your motivation to make a change.
Integrity guides you in life. Focusing on the TRUTH vs. giving people the answers they want to hear.
You thrive in a Team Environment. Collaboration is key in innovation and creating change.

These pillars of LIGHT are a reminder to our team that we are making a difference by providing guidance and support in navigating the often complex and confusing landscape of healthcare. We hope that through this LIGHT, individuals can find their way to the best care, resources, and support they need to get back to life.

If this sounds like you, we would love to connect to speak further about career opportunities at Lantern.

Please apply to our role & someone from our Talent Acquisition Team will reach out to help you navigate our interview process.

Lantern is seeking an experienced Senior Site Reliability Engineer to champion the reliability, availability, and performance of our Azure-based healthcare platform. In this pivotal role, you will define and implement SRE practices, drive incident management processes, build observability frameworks, and ensure our systems meet stringent uptime and compliance requirements. You will collaborate with platform engineers, application developers, and security teams to embed reliability into every layer of our infrastructure. This role is ideal for an SRE expert with deep experience in production operations, monitoring, incident response, and automation in cloud environments.

You will work on the Platform Engineering team, partnering with application developers, infrastructure engineers, and security teams to establish SRE best practices across Lantern. Your focus will be on building resilience, reducing toil through automation, and creating a culture of reliability that ensures our healthcare platform delivers consistent, high-quality service to our users.

Location: Hybrid - at least 3 days/wk in one of our offices: Dallas, TX / Chicago/Evanston / New York / Washington, DC
On-Call: This position requires being on-call 1 week per month

Responsibilities:

Define and track SLOs/SLIs/error budgets for critical healthcare services
Build and maintain observability platforms (monitoring, logging, alerting, tracing) using Datadog and Azure Monitor
Lead incident management processes using Rootly, including on-call rotations, runbooks, and post-incident reviews
Automate operational toil through Infrastructure-as-Code (Terraform) and custom tooling
Design and implement disaster recovery and business continuity strategies
Collaborate with development teams to improve service reliability through architecture reviews and chaos engineering
Optimize system performance, capacity planning, and cost efficiency for Azure infrastructure
Ensure production systems meet HIPAA, SOC 2, and other regulatory requirements
Maintain and improve CI/CD pipelines to support safe, rapid deployments
Mentor junior engineers and foster a culture of reliability and operational excellence

Requirements:

Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field, or equivalent practical experience.
4+ years in SRE, DevOps, or production operations roles
3+ years with Microsoft Azure (AWS/GCP a plus)
Strong experience with observability tools (Datadog, Azure Monitor, Prometheus, Grafana, or similar)
Experience defining and managing SLOs/SLIs and error budgets
Proven incident management and on-call experience (Rootly or similar incident management platforms)
Hands-on with Infrastructure as Code (Terraform) and CI/CD (Azure DevOps, GitHub Actions)
Experience in regulated environments (healthcare/HIPAA preferred)
Strong scripting skills (Python, Bash, PowerShell)
Excellent communication and collaboration skills
If you don’t meet every requirement listed, we still encourage you to apply.

Strong Candidates Will:

Deep experience with chaos engineering and reliability testing
Experience with Azure Kubernetes Service and containerized workloads
Relevant certifications (Azure, SRE, Kubernetes)

Benefits

Medical Insurance
Dental Insurance
Vision Insurance
Short & Long Term Disability
Life Insurance
401k with company match
Flexible Time Off
Paid Parental Leave

Lantern does not discriminate on the basis of race, sex, color, religion, age, national origin, marital status, disability, veteran status, genetic information, sexual orientation, gender identity or any other reason prohibited by law in provision of employment opportunities and benefits.

Similar jobs you might like

Technology

Webellian Sp.z o o

Senior Site Reliability Engineer

Senior

Hybrid

Warsaw, Poland

🏢 Summary: Site Reliability Engineer role focused on ensuring reliability, scalability, and operational excellence of a cloud-based analytics platform running AI services, Java APIs, and frontend applications. The position centers on Kubernetes infrastructure (AKS), Azure cloud services, Infrastructure as Code, observability, incident management, and automation. You will drive SLO/SLI practices, improve platform resilience, and eliminate operational toil through automation and CI/CD integration. 🗂️ Requirements: 5+ years experience in SRE, DevOps, or Platform Engineering, Strong Kubernetes cluster operations and troubleshooting experience, Hands-on experience with Terraform for Infrastructure as Code, Production experience with Azure services (AKS, ACR, Key Vault, Azure Monitor, Application Insights, VNet), Experience with Prometheus and Grafana for observability and alerting, Knowledge of SLO/SLI methodology and error budget management, Experience in incident management and on-call support, Scripting skills in Python or Bash, Experience with CI/CD and GitOps tools (GitHub Actions, ArgoCD) 📃 Skills: Kubernetes, AKS, Azure, Terraform, Prometheus, Grafana, Python, Bash, GitHub, ArgoCD, ACR, KeyVault, AzureMonitor, ApplicationInsights, VNet, Ingress, Docker, GitOps, CI/CD 🏢 Description: About Webellian Webellian is a well-established Digital Transformation and IT consulting company committed to creating a positive impact for our clients. We strive to make a meaningful difference in diverse sectors such as insurance, banking, healthcare, retail, and manufacturing. Our passion for cutting-edge and disruptive technologies, as well as our shared values and strong principles, are what motivate us. We are a community of engineers and senior advisors who work with our clients across industries, playing a deep and meaningful role in accelerating and realizing their vision and strategy. About the position As a Site Reliability Engineer within Advanced Analytics Team you will join the Infra team to own the reliability and operational health of the platform. You will define and maintain service level objectives, drive incident response at the infrastructure layer, and systematically eliminate operational toil through automation. You will work closely with Platform Engineers, Security Engineers, and the Run & Change team to ensure the platform meets its reliability commitments across production workloads spanning AI services, Java APIs, and frontend applications. Key responsibilities: Define, instrument, and maintain SLOs and SLIs for platform components; own error budget tracking and produce regular reliability reports for hub leadership. Serve on the on-call rotation as the infrastructure escalation tier; lead incident response for cluster-level, network-level, and storage failures; chair blameless post-incident reviews. Implement and operate Kubernetes infrastructure (AKS): cluster lifecycle management, networking, resource quotas, autoscaling configuration, and multi-tenancy patterns across spoke namespaces. Develop Infrastructure as Code (Terraform) to provision and manage Azure resources with consistency, auditability, and repeatable rollback capability. Build and maintain observability infrastructure: Prometheus, Grafana, Azure Monitor, and Application Insights; own alerting rules, dashboards, and distributed tracing coverage across platform components. Perform capacity planning and cost-aware resource management: right-size node pools, tune vertical and horizontal pod autoscalers, and identify resource waste across namespaces. Identify and eliminate toil: automate repetitive operational tasks through scripting and tooling; measure and track toil reduction over time. Maintain platform reliability procedures: rolling upgrades, backup and recovery testing, disaster recovery runbooks, and change freeze coordination. Contribute to CI/CD pipelines and GitOps tooling (GitHub Actions, ArgoCD) from a reliability and deployment safety perspective; work with the Platform Team on release gates and rollback mechanisms. Collaborate with the Run & Change team on incident SLA targets and operational procedures; work with Security Engineers on infrastructure hardening and vulnerability remediation. Required Experience & Skills 5+ years professional experience in site reliability engineering, DevOps, or platform engineering roles. Strong Kubernetes experience: cluster operations, networking (Ingress, network policies), storage, autoscaling, and hands-on troubleshooting across production environments. Solid Infrastructure as Code experience with Terraform; familiarity with Bicep or ARM templates is a plus. Production experience with Azure cloud services: AKS, ACR, Key Vault, Azure Monitor, Application Insights, Virtual Networks, and Private Endpoints. Strong observability experience: Prometheus, Grafana, centralized logging, alerting configuration, and distributed tracing instrumentation. Working knowledge of SLO/SLI methodology: error budget principles, reliability target setting, and capacity planning. Structured incident management experience: on-call ownership, blameless post-incident review, and runbook authorship. Scripting and automation proficiency in Python or bash for toil elimination and operational tooling. Strong CI/CD experience: GitHub Actions and ArgoCD or equivalent GitOps tooling. Ways of Working Comfortable in agile, iterative delivery environments with personal ownership and accountability for platform reliability. Clear communicator across global, cross-functional stakeholders; able to translate technical reliability metrics into business impact for non-technical audiences. Proactive learner with pragmatic adoption of AI-assisted developer tools (e.g., GitHub Copilot, Claude Code) to improve automation coverage and delivery velocity. Nice to Have Kubernetes certifications: CKA or CKAD. Experience supporting AI or ML infrastructure workloads: GPU scheduling, model serving platforms, or inference pipeline operations. Exposure to chaos engineering practices and fault injection testing. FinOps experience: reserved capacity planning, resource right-sizing programs, and cost attribution per team or workload. Service mesh experience (Istio, Linkerd) for traffic management and reliability patterns. Experience in regulated industries (insurance, finance, healthcare) where auditability, change traceability, and secure-by-default operations are standard practice. What we offer Contract under Polish law: B2B or Umowa o Pracę Benefits such as private medical care, group insurance, Multisport card English classes available Hybrid work (at least 1 day/week on-site) in Warsaw (Mokotów) Opportunity to work with excellent professionals High standards of work and focus on the quality of code New technologies in use Continuously learning and growth International team Pinball, PlayStation & much more (on-site) Join a growing team of dedicated professionals! We love to pass on the knowledge to grow excellence, speak our minds without playing politics, and just enjoy hanging around together. If you share our passions - we want to meet you! So go ahead and apply ➡️

Technology

Yard Corporate

Site Reliability Engineer (SRE)

Senior

Hybrid

Warsaw, Poland

40,000 - 55,000 PLN

🏢 Summary: Senior Site Reliability Engineer role focused on building and standardizing SRE practices across a hybrid AWS and on-prem infrastructure. The position centers on ensuring scalability, resilience, and high availability of high-frequency, data-intensive platforms through observability, automation, and Kubernetes optimization. You will define SLOs, enhance monitoring architecture, and drive reliability culture across engineering teams. 🗂️ Requirements: 5+ years experience in SRE, DevOps, or Infrastructure Engineering supporting distributed production systems, Bachelor’s degree in Computer Science, Computer Engineering, or related field (or equivalent experience), Deep expertise in Grafana, Prometheus, Loki, and Tempo (OpenTelemetry), Strong production experience with Docker and Kubernetes, Experience managing hybrid infrastructure (AWS and on-premises), Proficiency in at least one language: Python, Go, or Bash, Hands-on experience with CI/CD pipelines and Infrastructure-as-Code, Experience defining and managing SLOs and SLAs, Willingness to participate in on-call rotation 📃 Skills: AWS, Kubernetes, Docker, Prometheus, Grafana, Loki, Tempo, OpenTelemetry, Python, Go, Bash, CI/CD, IaC, Git, Hypervisors 🏢 Description: About the Client Our client is a premier, global investment management firm operating at the intersection of finance and technology. Known for their sophisticated, data-intensive systems, they build and maintain high-performance platforms that process massive volumes of market and operational data. To support their expanding footprint, they are looking for a senior-level Site Reliability Engineer (SRE) who will take ownership of shaping, standardizing, and scaling their SRE frameworks and reliability culture from the ground up. The Role In this role, you will serve as a foundational force for SRE practices, partnering directly with Cloud, Infrastructure, and Software Engineering squads. You will work across a hybrid infrastructure (combining advanced AWS cloud environments and physical on-premises servers) to guarantee the scalability, resilience, and maximum uptime of critical, high-frequency transactional platforms. Core Responsibilities SRE Evangelism: Design, implement, and champion core reliability principles, helping technology teams adopt sustainable scaling practices. Observability Architecture: Implement, scale, and maintain end-to-end monitoring, telemetry, and distributed tracing systems utilizing Prometheus, Grafana, Loki, and Tempo (OpenTelemetry framework). Kubernetes Optimization: Establish best-practice configurations for containerized workloads, ensuring applications running on Kubernetes are highly resilient, cost-effective, and performant. Incident Management & Culture: Participate in a balanced, shared on-call rotation (averaging one week per month). Automation & Engineering: Build custom tooling and CI/CD pipelines to automate routine tasks, system health checks, and rapid disaster recovery workflows. SLO/SLA Definition: Partner with product and engineering teams to define, monitor, and enforce Service Level Objectives (SLOs) and Error Budgets. What We Look For Experience: 5+ years of hands-on experience in a dedicated SRE, DevOps, or Infrastructure Engineering role supporting complex, distributed production systems. Education: A Bachelor’s degree in Computer Science, Computer Engineering, or a related technical discipline (or equivalent practical experience). Observability Expertise: Deep, subject-matter knowledge of modern monitoring stacks, specifically Grafana, Prometheus, Loki, and Tempo (OTel). Orchestration & Containers: Strong, production-grade expertise in containerization (Docker) and orchestration (Kubernetes). Hybrid Infrastructure: Experience navigating hybrid models—managing both cloud services (AWS preferred) and physical on-premise hardware resources. Scripting/Coding: Proficiency in writing clean, maintainable code in at least one scripting or programming language (e.g., Python, Bash, or Go) to build reliable automation. Methodologies: Solid grounding in CI/CD concepts, infrastructure-as-code (IaC), and agile development processes. Soft Skills: Excellent verbal and written communication skills, with a proven ability to convey complex infrastructure and reliability concepts to both technical and non-technical stakeholders. What We Offer Stable Employment: Full-time employment contract ( Umowa o Pracę - UoP ). Tax Optimization: Eligibility for creative tax-deductible costs ( KUP - Koszty Uzyskania Przychodu). Financial Reward: Highly competitive base salary accompanied by a generous annual performance bonus . Comprehensive Health: Premium private medical care package that fully includes dental coverage (stomatologia) . Wellness & Lifestyle: MultiSport card to keep you active and healthy. Daily Perks: Pre-funded lunch card for your daily meals. Tech Stack at a Glance Cloud & Virtualization: AWS, Kubernetes, Docker, On-Premises Hypervisors Observability: Prometheus, Grafana, Loki, Tempo, OpenTelemetry (OTel) Languages: Python, Go, Bash CI/CD & Automation: Git-based pipelines, Configuration Management, IaC

Technology

Cyclad

Senior Site Reliability Engineer (Azure & OpenShift)

Senior

Hybrid

Warsaw, Poland

130 - 140 PLN

🏢 Summary: Senior Site Reliability Engineer role focused on operating and optimizing Azure and OpenShift environments, ensuring reliability, scalability, and automation across staging and production systems. The position involves managing CI/CD, implementing Infrastructure as Code, enhancing observability, and leading incident management in a hybrid cloud setup. This is a hybrid role in Warsaw with collaboration across engineering teams to embed SRE and DevOps best practices. 🗂️ Requirements: 3+ years in SRE, DevOps, or Platform Engineering, Strong production experience with Microsoft Azure, Strong experience with OpenShift and Kubernetes, Experience with CI/CD pipelines and GitOps, Hands-on experience with Terraform, Experience with monitoring and logging tools, Scripting skills (Bash or similar), Knowledge of observability and reliability engineering principles, Experience in incident management and root cause analysis, Focus on automation and system stability 📃 Skills: Azure, OpenShift, Kubernetes, Terraform, ArgoCD, Jenkins, GitHubActions, Bash, Prometheus, Grafana, ELK, EFK, AzureMonitor, GitOps, CI/CD 🏢 Description: In Cyclad we work with top international IT companies in order to boost their potential in delivering outstanding, cutting edge technologies that shape the world of the future. Currently, we are looking for experienced Senior Site Reliability Engineer (Azure & OpenShift) to join our team. Project information: Location: Warsaw (hybrid) Type of employment: B2B contract or standard employment contract Project languages: English Key Responsibilities: Own and operate staging and production environments in Microsoft Azure Manage and support application deployments on OpenShift (on-prem and Azure) Support and optimize CI/CD pipelines and enable GitOps practices (e.g., ArgoCD) Ensure system reliability through SLIs, SLOs, and continuous improvement of service health Design, implement, and maintain observability solutions (monitoring, logging, alerting) using tools such as Prometheus, Grafana, Azure Monitor, and ELK/EFK Troubleshoot issues across infrastructure, platform (Azure/OpenShift), applications, and deployments Lead incident management, including root cause analysis (RCA), MTTR reduction, and prevention of recurring issues Build and maintain Infrastructure as Code using Terraform and drive automation to reduce operational toil Improve deployment reliability, release processes, and overall system resilience Collaborate with development teams to embed reliability into design, delivery, and operational practices Maintain and improve operational documentation, including runbooks and procedures Ensure performance, scalability, cost efficiency, security, and compliance of cloud infrastructure Advocate for SRE best practices and a DevOps culture across engineering teams Requirements: 3+ years of experience in SRE, DevOps, or Platform Engineering roles Strong experience with Microsoft Azure in production environments Strong experience with OpenShift Container Platform (OCP) and Kubernetes Experience with CI/CD pipelines (e.g., ArgoCD, Jenkins, GitHub Actions) and container-based deployments Strong understanding of observability, incident management, and reliability engineering principles Hands-on experience with Infrastructure as Code (Terraform) Scripting experience (Bash or similar) Experience with monitoring and logging tools (Prometheus, Grafana, ELK/EFK) Strong focus on automation, system stability, and continuous improvement We offer: Private medical care with dental care (covering 70% of costs). Family package option possible Multisport card (also for an accompanying person) Life insurance Work with talented engineers on large-scale, technically challenging projects

Technology

emagine Polska

Senior DevOps / SRE (Platform Reliability Engineer) - French fluent

Senior

Remote

Lisbon, Portugal

🏢 Summary: Senior DevOps / SRE role focused on ensuring reliability, scalability, security, and performance of a cloud-native AWS platform. The position centers on infrastructure automation, CI/CD, Kubernetes operations, observability, and implementing SRE best practices to support highly available production systems. You will lead incident management, optimize cloud costs, and drive continuous improvement of platform resilience. 🗂️ Requirements: 5+ years in DevOps/SRE/Cloud/Platform Engineering, Strong Linux administration and troubleshooting, Production experience with Kubernetes, Experience with CI/CD tools, Expertise in Infrastructure as Code, Hands-on experience with AWS, Strong networking fundamentals, Experience with monitoring and logging tools, Scripting skills (Bash or Python) 📃 Skills: AWS, Kubernetes, Docker, Helm, Terraform, Ansible, CloudFormation, Linux, GitLab, Jenkins, GitHub, Azure, Prometheus, Grafana, ELK, Datadog, Splunk, Bash, Python, TCP/IP, DNS 🏢 Description: We are looking for a Senior DevOps / Site Reliability Engineer (SRE) to ensure the reliability, scalability, performance, and security of our platform and cloud infrastructure. You will play a key role in building and operating cloud-native systems, improving observability, automating operations, implementing SRE best practices (SLOs/SLIs), and supporting development teams to deliver highly available services. Key Responsibilities Design, implement, and maintain highly available and scalable infrastructure on AWS. Own and improve the reliability of production systems using SRE principles (SLO, SLI, error budgets). Build and manage CI/CD pipelines to support fast and safe software delivery. Develop and maintain Infrastructure as Code (IaC) using Terraform, Ansible, CloudFormation, etc. Manage and optimize container orchestration platforms (Kubernetes, Docker, Helm). Implement and maintain monitoring, logging, and alerting solutions (Prometheus, Grafana, ELK, Datadog, Splunk). Lead incident response, perform root cause analysis, and write postmortems to drive continuous improvement. Improve system performance, capacity planning, scaling strategies, and disaster recovery processes. Collaborate closely with development teams to improve deployment strategies and system resilience. Implement security best practices (IAM, secret management, vulnerability scanning, patching). Define operational standards, runbooks, documentation, and best practices for platform reliability. Participate in on-call rotation and provide senior-level support for critical production issues. Key Responsibilities (5 Main Missions) The DevOps / SRE lead will be responsible for the stability and evolution of the platform. Your role is structured around five main areas: Mission 1: AWS Infrastructure Management (Build & Run) Mission 2: CI/CD and Deployment Automation Mission 3: Monitoring, Observability, and Alerting: Global Monitoring , Log Management , Application Monitoring , Business Analytics Mission 4: Incident Management, Resilience, and Security Mission 5: FinOps and AWS Cost Optimization Key Requirements 5+ years of experience in DevOps / SRE / Cloud Infrastructure / Platform Engineering. Strong expertise in Linux systems administration and troubleshooting. Proven experience with Kubernetes in production environments. Strong experience with CI/CD tools (GitLab CI, Jenkins, GitHub Actions, Azure DevOps). Solid knowledge of Infrastructure as Code (Terraform highly preferred). Experience with AWS cloud platforms. Strong understanding of networking fundamentals (TCP/IP, DNS, load balancing, reverse proxies). Experience with observability tools: monitoring, metrics, logging, tracing. Strong scripting skills (Bash, Python, or similar). French advanced level. Nice to Have Experience with additional cloud platforms (Azure, GCP). Strong understanding of networking fundamentals.

Technology

Link Group

DevOps / Site Reliability Engineer

Mid

Hybrid

Kraków, Poland

20,000 - 25,000 PLN

🏢 Summary: DevOps / Site Reliability Engineer role focused on building and maintaining scalable cloud infrastructure while improving platform reliability and automation. The position centers on Kubernetes-based environments, CI/CD pipeline development, and enhancing monitoring and observability. The engineer will support development teams through infrastructure as code and internal developer platform initiatives. 🗂️ Requirements: Experience with cloud platforms (Azure preferred), Strong experience with Kubernetes, Strong knowledge of Infrastructure as Code (Terraform), Hands-on experience with CI/CD tools, Experience with monitoring and observability tools, Understanding of scalability, reliability, and security best practices 📃 Skills: Azure, Kubernetes, Terraform, GitHubActions, ArgoCD, CI/CD, Datadog, Prometheus, Grafana, MongoDB, Rancher, Jenkins, PowerBI, Jira, Confluence 🏢 Description: DevOps / Site Reliability Engineer We’re looking for a DevOps / SRE to help build and maintain scalable cloud infrastructure and improve reliability across our platform. You’ll focus on automation, CI/CD, and supporting development teams with efficient tooling and processes. Key responsibilities Develop and manage cloud infrastructure (Azure preferred) Work with Kubernetes and containerized environments Build and maintain CI/CD pipelines (GitHub Actions, ArgoCD) Automate deployments and operational processes Contribute to Internal Developer Platform (IDP) development Improve monitoring and observability (e.g., Datadog, Prometheus, Grafana) Requirements Experience with cloud platforms and Kubernetes Strong knowledge of Infrastructure as Code (e.g., Terraform) Hands-on experience with CI/CD tools Understanding of scalability, reliability, and security best practices Experience with monitoring/observability tools Nice to have Experience with MongoDB Atlas, Rancher, Jenkins, Power BI Familiarity with Jira, Confluence

Technology

EPAM Systems

Senior Site Reliability Engineer (SRE)

Senior

Remote

🏢 Summary: The offer is for a Site Reliability Engineer responsible for ensuring high reliability, scalability, and performance of cloud-based systems. The role focuses on implementing SRE practices, automating infrastructure, managing incidents, and enhancing monitoring and CI/CD processes. You will collaborate with cross-functional teams to optimize operations and maintain service excellence. 🗂️ Requirements: Bachelor’s degree in Computer Science, Engineering, or related field, 3+ years of experience in Site Reliability Engineering or similar role, Experience with cloud platforms (AWS, GCP, or Azure), Hands-on experience with SRE practices (SLO, SLI, error budgets, postmortems, toil reduction, capacity planning, incident management), Proficiency in Python or other scripting/programming language, Experience with monitoring tools, Experience with CI/CD tools, Experience with infrastructure as code, Experience with configuration management, Knowledge of Kubernetes and Docker, English proficiency B2 or higher 📃 Skills: AWS, GCP, Azure, Python, Kubernetes, Docker, CI/CD, Terraform, Ansible, Monitoring, SLO, SLI, Git, Bash 🏢 Description: We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our team. In this critical role, you will collaborate closely with software developers and operations teams to ensure high reliability, scalability, and efficiency of our systems, with a strong focus on meeting and exceeding customer expectations. Your expertise will be crucial in deploying, maintaining, and automating our infrastructure and application environments to ensure seamless user experiences. Your proactive involvement will be key to enhancing system reliability, optimizing resource utilization, and ensuring continuous improvement in our operational practices. Your responsibilities will include defining and tracking Service Level Objectives (SLOs), managing error budgets, and reducing toil through automation. You will play a pivotal role in driving the success of technology initiatives, maximizing their impact across the organization, and ensuring that solutions consistently meet the high standards our customers expect. Responsibilities Collaborate with development, security, quality, and operation teams to implement SRE practices and ensure system reliability Define and support required level of reliability, availability, and performance for services and applications Design and deliver Cloud-based solutions tailored to client needs Troubleshoot, mitigate, and support fixing of the infrastructure and application issues in a timely manner Implement a monitoring system for the infrastructure and application reliability Communicate technical concepts clearly to both engineering teams and management stakeholders Requirements Bachelor’s degree in Computer Science, Engineering, or a related field 3+ years of hands-on experience in Site Reliability Engineering or related roles Proven experience in any cloud (AWS/GCP/Azure) Experience with implementing SRE practices such as SLO/SLI, Error budgets, Postmortems, Reducing Toil, capacity planning, and Incident Management Python or other scripting/programming language Strong background in monitoring tools Proficiency in CI/CD tools, infrastructure as code, and configuration management Solid knowledge of container orchestration technologies (Kubernetes, Docker) English language proficiency at an Upper-Intermediate level (B2) or higher Nice to have Expertise in deployment and management of LLMs, including technologies like RAG Certification in Kubernetes, AWS/GCP/Azure, or similar technologies Proven experience in DevOps Knowledge of managing and optimizing AI/ML models in production environments, including basic deployment, monitoring, and maintenance We offer/Benefits We gather like-minded people: Engineering community of industry professionals Friendly team and enjoyable working environment Flexible schedule and opportunity to work remotely within Poland Chance to work abroad for up to 60 days annually Business-driven relocation opportunities We provide growth opportunities: Outstanding career roadmap Leadership development, career advising, soft skills, and well-being programs Certification (GCP, Azure, AWS) Unlimited access to LinkedIn Learning, Get Abstract, Cloud Guru English classes We cover it all: Stable income (Employment Contract or B2B) Participation in the Employee Stock Purchase Plan Benefits package (health insurance, multisport, shopping vouchers) Strategically located offices featuring entertainment and relaxation zones, table tennis and football, free snacks, fantastic coffee, and more Referral bonuses Corporate, social and well-being events Please, note: The set of bonuses might vary based on the role you apply for – specifics will be discussed with our recruiter during the general interview. We will reach out to selected candidates exclusively. EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Technology

Medallion

Technical Solutions Manager

Senior

Remote

110,004 - 140,004 USD/yr

🏢 Summary: Technical Solutions Manager responsible for onboarding enterprise customers by leading complex data migrations and building integrations into the platform. The role focuses on designing and executing end-to-end ETL processes, developing SQL and Python-based automation, and collaborating cross-functionally to ensure successful implementations. You will work hands-on with large, complex datasets and create scalable integration solutions to support workflow automation. 🗂️ Requirements: 6+ years of experience in data modeling and/or data migration with complex datasets, 4+ years of experience working directly with enterprise clients, Proven ability to write complex SQL queries, Proven ability to write Python scripts from scratch, Experience building bi-directional integrations between external and internal systems, Experience designing and optimizing end-to-end ETL processes, Familiarity with relational databases (PostgreSQL, MySQL, SQL Server, or Oracle), Experience managing multiple complex data projects simultaneously 📃 Skills: SQL, Python, ETL, PostgreSQL, MySQL, SQLServer, Oracle, DataModeling, DataMigration, DataPipelines, Integrations, SaaS 🏢 Description: About the role: As a Technical Solutions Manager your primary responsibility will be getting our Enterprise customers successfully onboarded onto the Medallion platform. You will directly build integrations, migrate large data into Medallion and develop documentation and tools to make these processes more efficient as the business grows. As part of your role, you will engage hands-on with customers during a critical phase in their customer journey and collaborate with colleagues in Customer Success, Operations, EPD, and Sales. You will work with complex and varied client data models, quickly assess data quality, and build tailored solutions to meet each client's migration needs. You will also build evergreen integrations between internal and external systems to support workflow automation. This role reports to the Head of Technical Solutions. Base compensation may land between $110,000-$140,000. In addition to base salary, equity and benefits are offered as part of the total compensation package. Responsibilities: - Support largest customers to strategize how they will migrate their existing data to the platform - Own the end-to-end data migration process, including data extraction from legacy systems, cleansing, transformation, and loading into target platforms - Engage directly with clients or stakeholders to gather data requirements, resolve data discrepancies, and communicate migration progress or risks - Become the SME on customers' data while tracking milestones throughout the implementation - Build evergreen integrations from scratch to support the automation of workflows - Write SQL queries for data extraction, transformation, and reporting, with a focus on clean, maintainable code - Work with complex and varied client data models, quickly assess data quality, and build tailored solutions to meet each client's migration needs - Set customer expectations and timelines, perform data validation, and migrate complex datasets into the platform - Identify blockers that could delay an implementation and partner with internal and external teams to resolve them - Build Python-based automation scripts for data ingestion, transformation, reconciliation, and migration validation - Partner with the EPD team to productize repetitive aspects of implementations that can be automated Requirements: - 6+ years of experience in data modeling and/or migration, ideally with complex data sets - 4+ years of experience working directly with enterprise clients - Proven experience writing complex SQL queries and Python scripts, not just modifying existing code - Ability to build bi-directional integrations between external and internal platforms - Ability to design, build, and optimize queries and data pipelines end-to-end, including ETL processes - Familiarity with relational databases (PostgreSQL, MySQL, SQL Server, or Oracle) in a migration or data pipeline context - Experience coordinating multiple complex data projects simultaneously

Technology

Grid Dynamics Poland

Senior Site Reliability Engineer (SRE)

Senior

Hybrid

Warsaw, Poland

100 - 128 PLN

🏢 Summary: Senior Site Reliability Engineer role focused on ensuring reliability, performance, and resilience of enterprise products by bridging infrastructure and software engineering. The position involves hands-on Java/Spring Boot code fixes, Kubernetes-based container operations, incident response, and proactive architecture improvements. The engineer drives automation, observability, and security best practices across the SDLC. 🗂️ Requirements: 5+ years experience in SRE or Platform Engineering, Strong proficiency in Java, Strong proficiency in Spring Boot, Experience with Hibernate, Experience with Jenkins, Ability to read, analyze and fix application code, Hands-on experience with Docker, Hands-on experience with Kubernetes, Deep knowledge of Linux systems, Strong understanding of networking, Experience with distributed systems, Experience with monitoring and observability tools, Bachelor’s degree in Computer Science, Systems Engineering or equivalent experience 📃 Skills: Java, Spring, Hibernate, Jenkins, Docker, Kubernetes, Linux, Networking, Prometheus, Grafana, Splunk 🏢 Description: We are looking for an experienced Senior Site Reliability Engineer to join our team and oversee the reliability, resilience, and performance of our core enterprise products. In this role, you will bridge the gap between infrastructure operations and software engineering. You won't just react to alerts - you will proactively analyze system architecture, build automation, and dive deep into the application code (Java/Spring Boot) to fix bugs and eliminate issues at their root. Responsibilities: Architecture & Reliability: Understand the end-to-end product topology from both infrastructure and application perspectives. Identify bottlenecks, scale limitations, and unstable components, driving long-term resolutions before they impact production. Incident Response & RCA: Respond to outages, provide L3 on-call technical support (on rotation), and perform blameless Root Cause Analysis (RCA) to implement permanent fixes. Hands-on Engineering: Address defects, perform code bug fixes directly in production, and recommend architectural improvements during incident analysis. Security & Vulnerability Management: Oversee vulnerability management for applications and containers, manage patching processes, ensure compliance, and monitor certificate expirations and renewals according to global best practices. SRE Advocacy & SDLC: Represent the SRE organization in design reviews, capacity planning, and operational readiness exercises. Partner closely with development teams to embed reliability best practices early in the SDLC. Automation & Mentoring: Build automation tools to reduce manual toil and improve efficiency. Spread SRE culture, create standard documentation, and provide technical mentorship to junior team members. System Health: Oversee the production environment by tracking availability, applying learnings from observability tools, and becoming a Subject Matter Expert (SME) on core issuing products. Min requirements: Experience: 5+ years of experience in Site Reliability Engineering (SRE) or Platform Engineering roles. Software Engineering: Strong proficiency in Java, Spring Boot, Hibernate , and Jenkins. Ability to read, analyze, and fix application code. Containerization: Hands-on expertise with Docker and container orchestration using Kubernetes . Infrastructure: Deep knowledge of Linux systems, networking, and distributed architectures. Observability: Strong understanding of monitoring, logging, and observability tools (e.g., Prometheus, Grafana, Splunk). Education: Bachelor’s degree in Computer Science, Systems Engineering, or equivalent practical experience. Soft Skills: Excellent problem-solving abilities and strong communication skills. Would be a plus: Infrastructure as Code & Cloud: Hands-on experience with tools like Terraform or Ansible, alongside familiarity with major public cloud providers (AWS, GCP, or Azure). Advanced Networking & Service Mesh: Knowledge of service mesh technologies (e.g., Istio, Linkerd) for traffic management, security, and observability in microservices architectures. Industry Experience: Previous background in the FinTech, payments, or banking sectors, with an understanding of high-security compliance standards (e.g., PCI-DSS). We offer: Opportunity to work on bleeding-edge projects Work with a highly motivated and dedicated team Competitive salary Flexible schedule Benefits package - medical insurance, sports Corporate social events Professional development opportunities Well-equipped office About us: Grid Dynamics (NASDAQ: GDYN) is a leading provider of technology consulting, platform and product engineering, AI, and advanced analytics services. Fusing technical vision with business acumen, we solve the most pressing technical challenges and enable positive business outcomes for enterprise companies undergoing business transformation. A key differentiator for Grid Dynamics is our 8 years of experience and leadership in enterprise AI , supported by profound expertise and ongoing investment in data , analytics , cloud & DevOps , application modernization and customer experience . Founded in 2006, Grid Dynamics is headquartered in Silicon Valley with offices across the Americas, Europe, and India.

Technology

Caspian One

Site Reliability Engineer

Senior

Hybrid

Krakow, Poland

1,400 - 1,800 PLN

🏢 Summary: Hands-on Site Reliability Engineer role focused on ensuring stability, scalability, and observability of a mission-critical distributed risk and analytics platform in hybrid cloud environments. The position centers on production reliability, incident response, automation, and continuous improvement of monitoring and deployment processes. You will collaborate with engineering teams to strengthen system resilience, performance, and operational standards. 🗂️ Requirements: Strong Java experience in distributed systems, Experience with observability and monitoring tools, Hands-on experience with hybrid cloud environments (preferably GCP), Experience with CI/CD pipelines and automation tools, Solid knowledge of Linux systems administration, Understanding of RDBMS fundamentals, Experience with job schedulers (e.g., Control-M), Ability to lead incident response and root-cause analysis 📃 Skills: Java, Grafana, Prometheus, Loki, OpenTelemetry, GCP, Jenkins, Ansible, Linux, SQL, Control-M, CI/CD 🏢 Description: We’re looking for a seasoned Site Reliability Engineer to support a high‑performance, mission‑critical risk and analytics platform used across global trading and finance environments. You’ll play a key role in ensuring the stability, scalability, and observability of complex distributed systems running across hybrid cloud infrastructure. In this role, you’ll take ownership of production reliability driving incident response, conducting root‑cause analysis, improving monitoring capabilities, and delivering automation that reduces operational toil. You’ll work closely with development teams, platform engineers, and service management leads to strengthen resilience, refine processes, and enhance the engineering culture around availability and performance. This is a hands on technical position suited to someone who thrives in high‑throughput environments, communicates clearly, and enjoys solving deep engineering problems in real time. Core Responsibilities Maintain and improve the reliability, uptime, and performance of distributed applications. Lead incident response, triage complex issues, coordinate recoveries, and deliver structured post‑incident reviews. Enhance observability—designing and evolving monitoring, alerting, logging, and tracing frameworks. Drive continuous improvement across automation, deployment processes, and service stability. Collaborate with cross‑functional teams to influence architecture, design, and operational standards. Support CI/CD pipelines, environment configuration, and vulnerability remediation. Contribute to a knowledge‑driven culture through documentation, tooling, and best‑practice adoption. Required Skills & Experience Strong Java background with proven experience supporting or developing distributed systems. Observability tooling expertise (Grafana, Prometheus, Loki, OpenTelemetry or similar). Hands‑on with hybrid cloud environments , ideally with GCP or another major cloud provider. CI/CD and automation experience (e.g., Jenkins, Ansible). Solid understanding of Linux , RDBMS fundamentals , and job schedulers (e.g., Control‑M or equivalents). Strong analytical mindset with a methodical approach to troubleshooting. Excellent communication skills and comfort working in Agile teams.

Employer Direct Healthcare

Employer Direct Healthcare, operating through its Lantern platform, is a healthcare technology company focused on specialty care management. It connects employees of large organizations with a curated Network of Excellence composed of leading specialists in areas such as surgery, cancer care, and infusions. By combining a nationwide specialist network with dedicated care teams and a technology-driven platform, the company aims to improve healthcare quality while reducing costs for employers and their workforces. Serving more than 6 million members and trusted by many of the nation’s largest employers, Lantern emphasizes accessibility, clinical excellence, cost efficiency, and human-centered support. Its culture is guided by core values centered on logic, inclusion, grit, humanity, integrity, and teamwork, reflecting a mission to simplify the healthcare journey and help individuals return to health and everyday life.

Check if your resume is ATS-ready before applying →Build an ATS-optimized resume