Senior Site Reliability Engineer
1 settimana fa
Ahold Delhaize USA, a division of global food retailer Ahold Delhaize, is part of the U.S. family of brands, which includes five leading omnichannel grocery brands – Food Lion, Giant Food, The GIANT Company, Hannaford and Stop & Shop. Our associates support the brands with a wide range of services, including Finance, Legal, Sustainability, Commercial, Digital and E-commerce, Technology and more.
Primary Purpose
The Site Reliability Engineer (SRE) III is responsible for ensuring the scalability, reliability, and performance of production systems through automation, observability, incident response, and infrastructure engineering. This role involves designing and implementing robust operational processes and tooling to support highly available, fault-tolerant systems in a cloud-native environment. The SRE III collaborates closely with engineering squads, product teams, and stakeholders to embed reliability best practices across the software delivery lifecycle. The role includes ownership of system uptime, service level objectives (SLOs), and operational excellence, along with mentoring junior engineers and leading cross-functional initiatives that improve system resilience.
Our flexible/hybrid work schedule includes 3 in-person days at our Chicago office and 2 remote days.
Applicants must be currently authorized to work in the United States on a full-time basis.
Duties & Responsibilities
- Design and implement infrastructure solutions that ensure system availability, scalability, and reliability across cloud-native environments like AKS and Kubernetes.
- Develop automation for provisioning, deployment, configuration, monitoring, and incident remediation using tools such as Terraform, ArgoCD, and GitHub Actions.
- Collaborate with engineering teams to define and track service level objectives (SLOs) and service level indicators (SLIs).
- Build and manage microservices-based platforms leveraging Spring Boot, Java, Tomcat, and Redis.
- Monitor production environments using Datadog and proactively address performance and reliability issues.
- Perform root cause analysis and lead post-incident reviews to drive continual improvement.
- Manage CI/CD pipelines and deployment automation using GitHub, Docker, and container orchestration technologies.
- Create and maintain infrastructure as code (IaC) using Terraform, with deployment pipelines integrated into GitOps workflows.
- Lead and support operational readiness reviews, game days, chaos engineering practices, and failure mode analysis.
- Build scalable observability and alerting frameworks with Datadog.
- Implement resilient, asynchronous architectures using Kafka for event-driven services.
- Reduce operational toil through self-healing automation and proactive system tuning.
- Troubleshoot Linux-based environments such as Ubuntu and optimize them for reliability.
- Provide on-call support and ensure 24/7/365 system reliability for mission-critical applications.
- Collaborate with the security team to enforce secure operational practices and cloud compliance.
- Mentor junior engineers and contribute to documentation, technical design, and knowledge-sharing across the organization.
Qualifications
- Bachelor's Degree in Computer Science, Information Systems, or a related technical field; equivalent training, certifications, or experience will be considered.
- 5+ years of experience in a Site Reliability Engineering, or DevOps, or Java programming role.
- Experience managing production-grade systems and services on AKS/Kubernetes in distributed environments.
- Proficiency in programming and scripting languages including Python, Java, Bash, or Go.
- Proven experience with Spring Boot, Tomcat, Redis, and microservices architecture.
- Hands-on experience in managing Linux environments, particularly Ubuntu.
- Proficiency with observability stacks and performance monitoring using Datadog, Prometheus, and ELK.
- Deep understanding of containerization and orchestration using Docker, Kubernetes, and ArgoCD.
- Experience managing event-driven systems using Kafka.
- Expertise in IaC and automation using Terraform and GitHub Actions.
- Familiarity with networking concepts, DNS, load balancing, and cloud infrastructure (AWS, Azure, or GCP).
- Strong analytical, debugging, and problem-solving skills.
- Excellent verbal and written communication skills and the ability to collaborate effectively across teams.
Salary Range: $125,040 - $187,560
Actual compensation offered to a candidate may vary based on their unique qualifications and experience, internal equity, and market conditions. Final compensation decisions will be made in accordance with company policies and applicable laws.
#LI-Hybrid #LI-CW1
At Ahold Delhaize USA, we provide services to one of the largest portfolios of grocery companies in the nation, and we're actively seeking top talent. Our team shares a common motivation to drive change, take ownership and enable our brands to better care for their customers. We thrive on supporting great local grocery brands and their strategies.
We offer an experience where our associates are valued; Diversity, Equity, Inclusion and Belonging are infused in our business and our employees are representative of the communities that we serve. We believe in total wellness, which encompasses a blend of physical, financial and emotional wellness.
We believe in collaboration, curiosity, and continuous learning in all that we think, create and do. While building a culture where personal and professional growth are just as important as business growth, we invest in our people, empowering them to learn, grow and deliver at all levels of the business.
-
Site Reliability Engineer
1 settimana fa
Italia Reply A tempo pieno 40.000 € - 80.000 € all'anoIl mondo del Cloud è la tua passione? Ti piacerebbe diventare un esperto di Cloud Computing, DevOps e Automation all'interno di un team che affronta ogni giorno nuove sfide? In Cloud9, startup del gruppo Reply, stiamo ricercando un Site Reliability Engineer per supportare i nostri Clienti nella gestione ed evoluzione di architetture Hybrid & Multicloud di...
-
Senior Site Reliability Engineer
3 settimane fa
Italia Remotely A tempo pienoLocation LATAM, ERUOPE CloudDevs works with fast-moving, venture-backed startups across the US. We're building a pool of world-class Site Reliability Engineers for current roles and for upcoming opportunities. You will either be placed directly into one of our partner startups or added to our vetted SRE network for future projects. This role is ideal for...
-
Senior Site Reliability Engineer
3 settimane fa
italia Remotely A tempo pienoLocation LATAM, ERUOPE CloudDevs works with fast-moving, venture-backed startups across the US. We’re building a pool of world-class Site Reliability Engineers for current roles and for upcoming opportunities. You will either be placed directly into one of our partner startups or added to our vetted SRE network for future projects. This role is ideal for...
-
Principal Site Reliability Engineer
1 settimana fa
Italia Ahold Delhaize A tempo pieno 146.960 € - 220.440 € all'anoAhold Delhaize USA, a division of global food retailer Ahold Delhaize, is part of the U.S. family of brands, which includes five leading omnichannel grocery brands – Food Lion, Giant Food, The GIANT Company, Hannaford and Stop & Shop. Our associates support the brands with a wide range of services, including Finance, Legal, Sustainability, Commercial,...
-
Site Reliability Engineer
3 settimane fa
Italia Immobiliare.it A tempo pienoImmobiliare.it S.p.A. è un gruppo italiano composto da società specializzate in servizi Digital Tech per la compravendita e l’affitto di immobili, rivolti a privati, professionisti del real estate, istituti bancari e operatori del settore finanziario. Fondata nel 2005 Immobiliare.it, il portale immobiliare N.1 in Italia, ha ampliato la propria offerta...
-
Senior SRE: Scale Reliability
3 settimane fa
Italia Remotely A tempo pienoA global startup recruitment firm is seeking experienced Site Reliability Engineers to enhance system reliability and performance. Ideal candidates have 5+ years in SRE roles and expertise in cloud infrastructure and observability tools. This position offers the chance to work across various startups and influence the reliability standards of the tech...
-
Site reliability engineer
2 settimane fa
Italia Meridionale Immobiliare.it A tempo pienoImmobiliare.it S.p. A. è un gruppo italiano composto da società specializzate in servizi Digital Tech per la compravendita e l'affitto di immobili, rivolti a privati, professionisti del real estate, istituti bancari e operatori del settore finanziario. Fondata nel 2005 Immobiliare.it, il portale immobiliare N.1 in Italia, ha ampliato la propria offerta con...
-
Site Reliability Engineer
20 ore fa
Italia Meridionale Immobiliare.it A tempo pienoImmobiliare.it S.p.A. è un gruppo italiano composto da società specializzate in servizi Digital Tech per la compravendita e l'affitto di immobili, rivolti a privati, professionisti del real estate, istituti bancari e operatori del settore finanziario. Fondata nel 2005 Immobiliare.it, il portale immobiliare N.1 in Italia, ha ampliato la propria offerta con...
-
Site reliability engineer
2 settimane fa
Italia Immobiliare.it A tempo pienoImmobiliare.it S.p. A. è un gruppo italiano composto da società specializzate in servizi Digital Tech per la compravendita e l'affitto di immobili, rivolti a privati, professionisti del real estate, istituti bancari e operatori del settore finanziario. (…)Immobiliare.it Insights, la proptech della società, offre servizi digitali di advisory, insights e...
-
Principal Site Reliability Engineer
3 settimane fa
Italia SaaS Industry A tempo pienoPrincipal Site Reliability Engineer - Azure Red Hat OpenShift in Madrid or RemoteThe Red Hat Site Reliability Engineering (SRE) team is looking for a Principal Site Reliability Engineer to join us. In this role, you will develop, scale, and operate our OpenShift managed cloud services. OpenShift is Red Hat’s enterprise Kubernetes distribution. As an SRE...