Principal Software Site Reliability Engineer - Problem Management & RCA
Posted on: June 12, 2021
At AT&T, we're connecting the world through the latest tech,
top-of-the-line communications and the best in entertainment. Our
groundbreaking digital solutions provide an intuitive and
integrated experience for customers across online, retail and care
channels. Join our mission to deliver compelling communication and
entertainment experiences to customers around the world. You'll
drive how we deliver a seamless and fast customer experience with
digital at the center of AT&T's distribution channels. We're
offering an opportunity to revolutionize the digital space and the
chance to create a career that will propel your future.
Principal Software/Site Reliability Engineer - Problem Mgmt &
This position is responsible for driving 24x7 Problem/Incident
Mgmt impact and RCA assessment and communication for Consumer
online Sales, Account Management, and Support websites and mobile
apps. This position will define Service Level Objectives (SLOs) and
also track & drive availability & service metrics, and
accomplishment of operational SLOs.
- Analysis of GTOC enterprise Incidents including implementing
automated tracking and reporting of system, customer & business
impacts from site outages, incidents, and critical defects.
- Weekly and monthly analysis of progress & accomplishment
against Service Level Objectives (SLOs) and identifying/driving gap
closures where necessary.
- Coordinating with GTOC, Digital Product Delivery (PO/PM, Dev,
QA), Operations, Site Reliability Engineers, Infrastructure/Network
& 3rd Party vendors to drive resolution of reported problems.
- Leading Root-Cause Analysis (RCA) for complex outages,
incidents, and critical/major defects, and tracking resolution
- Provide training to teams and audit RCAs to ensure blameless
post-mortems are conducted per established principles and the
resulting information is actionable to ensure the same problems do
not occurs more than once.
- Developing tools, scripts, queries and performing data analysis
of weekly/month/YTD incidents/problems to determine
chronic/recurring root causers and applications with high frequency
- Partnering with Site Reliability Engineers (SREs), DevOps
teams, Network, Infrastructure, Security & Fraud services to
establish proactive and automated monitoring/alerting for chronic
root causers, establish get-well/ improvement plans and driving
established improvement plans through to resolution.
- 8+ years related experience with a bachelor's degree in
Computer Science, Information Systems or related field.
- 6+ years of progressive experience in one or more of the
following areas: application delivery; subject matter expertise in
building Java-based high-volume/high-transaction e-commerce
- 6+ years of experience building web applications using
- 3+ years of experience working with front end frameworks such
as React, Angular
- 4+ years of experience in architecture and design of systems
using Micro services architecture
- 4+ years of experience in a leadership capacity - coaching and
mentoring engineers, developers
- 2+ years of experience working with SPA/PWA architectures
- 2+ years of experience with server-side rendering technologies
- 2+ years of experience in cloud technologies: AWS, Azure,
OpenStack, Docker, Kubernetes, Ansible, Chef or Terraform
- 2+ years of experience in build and CICD technologies: GitHub,
Maven, Jenkins, Nexus or Sonar
- 4+ years of experience in Unit and Function testing using
Junit, Spock, Mockito/JMock, Selenium, Cucumber, SoapUI or
- Proficiency in Unix/Linux command line
- Expert knowledge and experience working with asynchronous
message processing, stream processing and event driven
- Experience working within Agile/Scrum/Kanban development
- Excellent written and verbal communication skills with
demonstrated ability to present complex technical information in a
clear manner to peers, developers, and senior leaders
services, NoSql technologies (Cassandra/MongoDb), Kafka/MQ/Rabbit,
Redis/Hazelcast, Git, Jira, Jenkins, Docker, Kubernetes
AT&T is leading the way to the future - for customers,
businesses and the industry. We're developing new technologies to
make it easier for our customers to stay connected to their world.
Together, we've built a premier integrated communications and
entertainment company and an amazing place to work and grow. Team
up with industry innovators every time you walk into work, creating
the world you always imagined. Ready to #transformdigital with us?
Job ID 2040189 Date posted 05/16/2021
Keywords: AT&T, Dallas , Principal Software Site Reliability Engineer - Problem Management & RCA, Other , Dallas, Texas
Didn't find what you're looking for? Search again!