Site Reliability Engineer (SRE) – Public Cloud, AVP-C12 (Hybrid)

September 18, 2023

Job Description

About Citi:

Citi, the leading global bank, has approximately 200 million customer accounts and does business in more than 160 countries and jurisdictions. Citi provides consumers, corporations, governments, and institutions with a broad range of financial products and services, including consumer banking and credit, corporate and investment banking, securities brokerage, transaction services, and wealth management.


As a bank with a brain and a soul, Citi creates economic value that is systemically responsible and in our clients’ best interests. As a financial institution that touches every region of the world and every sector that shapes your daily life, our Enterprise Operations & Technology teams are charged with a mission that rivals any large tech company. Our technology solutions are the foundations of everything we do from keeping the bank safe, managing global resources, and providing the technical tools our workers need to be successful to designing our digital architecture and ensuring our platforms provide a first-class customer experience. We reimagine client and partner experiences to deliver excellence through secure, reliable, and efficient services.


Our commitment to diversity includes a workforce that represents the clients we serve from all walks of life, backgrounds, and origins. We foster an environment where the best people want to work. We value and demand respect for others, promote individuals based on merit, and ensure opportunities for personal development are widely available to all. Ideal candidates are innovators with well-rounded backgrounds who bring their authentic selves to work and complement our culture of delivering results with pride. If you are a problem solver who seeks passion in your work, come join us. We’ll enable growth and progress together.

The Role:

At Citi we know how important reliability is for our customers. Our Site Reliability Engineers (SREs) bring drive and determination to ensure our customers get the best possible experience interacting with our technology services. As a Site Reliability Engineer (SRE) in our Public Cloud group, you will be working on complex and difficult technical problems solving for scale, performance and availability.  The ideal candidate has experience gained in a software development environment and a deep appreciation of best practices for the design and deployment of fault tolerance solutions for cloud platforms.



  • Engage with systems engineering and application development teams at all stages of the technology life-cycle
  • Express opinions and ideas related to reliability, fault tolerance and operational toil
  • Devise innovative ideas for solving difficult technical problems involving distributed systems, scale and security to translate these ideas into designs and implementation
  • Implement best practices around availability, scalability, operational excellence and efficiency using data driven analysis techniques when appropriate
  • Identify, triage, and automate systems
  • Help evolve systems by pushing for change that improves reliability and developer velocity
  • Help develop robust organizational practices around monitoring, alerting, testing, deployment, and incident response
  • Help identify key uptime and performance metrics for production systems and implement metrics based practice and process
  • Suggest methods and new technologies for increasing the effectiveness of changes and of general production support improvements


  • 3-5 years’ relevant hands-on experience per the qualifications below
  • At least 3 years’ experience with Public Cloud Containerization Technology – ECS or EKS
  • At least 3 years’ experience with Network Load Balancing technologies
  • Hands-on experience developing and engineering software in Java, Python, C++, or Ruby
  • Experience with modern SDLC tools and the ability to develop and enforce CI/CD practices
  • Proficiency with monitoring and observability technologies like Prometheus and Grafana
  • Familiarity with Domain Driven Design and Event Driven Architecture
  • Experience working in a distributed, cloud-based environment using Azure/AWS/GCP (Docker/Kubernetes)
  • Experience with Service Oriented Architecture applications and cloud-based services, preferably AWS
  • Experience working closely with or as part of a Technology Operations team with a firm understanding of meeting demands and overcoming the challenges of that domain
  • Experience with any of the following would be a big plus: TDD & automated UI testing frameworks, any design frameworks, mobile web development, in-depth Linux troubleshooting


  • Bachelor’s Degree or equivalent relevant experience
  • Certifications in Cloud Security (AWS, GCP, etc.) and/or OpenStack Administrator are a big plus

This job description provides a high-level review of the types of work performed. Other job-related duties may be assigned as required.


Job Family Group:



Job Family:

Systems & Engineering


Time Type:

Full time


Primary Location:

Irving Texas United States


Primary Location Salary Range:

$89,620.00 – $134,430.00


Citi is an equal opportunity and affirmative action employer.

Qualified applicants will receive consideration without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or status as a protected veteran.

Citigroup Inc. and its subsidiaries (“Citi”) invite all qualified interested applicants to apply for career opportunities. If you are a person with a disability and need a reasonable accommodation to use our search tools and/or apply for a career opportunity review Accessibility at Citi.

View the “EEO is the Law” poster. View the EEO is the Law Supplement.

View the EEO Policy Statement.

View the Pay Transparency Posting