DallasRecruiter Since 2001
the smart solution for Dallas jobs

Principal Software Site Reliability Engineer - Problem Management & RCA

Company: AT&T
Location: Dallas
Posted on: June 12, 2021

Job Description:

At AT&T, we're connecting the world through the latest tech, top-of-the-line communications and the best in entertainment. Our groundbreaking digital solutions provide an intuitive and integrated experience for customers across online, retail and care channels. Join our mission to deliver compelling communication and entertainment experiences to customers around the world. You'll drive how we deliver a seamless and fast customer experience with digital at the center of AT&T's distribution channels. We're offering an opportunity to revolutionize the digital space and the chance to create a career that will propel your future.

Principal Software/Site Reliability Engineer - Problem Mgmt & RCA

Position Overview

This position is responsible for driving 24x7 Problem/Incident Mgmt impact and RCA assessment and communication for Consumer online Sales, Account Management, and Support websites and mobile apps. This position will define Service Level Objectives (SLOs) and also track & drive availability & service metrics, and accomplishment of operational SLOs.


  • Analysis of GTOC enterprise Incidents including implementing automated tracking and reporting of system, customer & business impacts from site outages, incidents, and critical defects.
  • Weekly and monthly analysis of progress & accomplishment against Service Level Objectives (SLOs) and identifying/driving gap closures where necessary.
  • Coordinating with GTOC, Digital Product Delivery (PO/PM, Dev, QA), Operations, Site Reliability Engineers, Infrastructure/Network & 3rd Party vendors to drive resolution of reported problems.
  • Leading Root-Cause Analysis (RCA) for complex outages, incidents, and critical/major defects, and tracking resolution through completion.
  • Provide training to teams and audit RCAs to ensure blameless post-mortems are conducted per established principles and the resulting information is actionable to ensure the same problems do not occurs more than once.
  • Developing tools, scripts, queries and performing data analysis of weekly/month/YTD incidents/problems to determine chronic/recurring root causers and applications with high frequency of incidents.
  • Partnering with Site Reliability Engineers (SREs), DevOps teams, Network, Infrastructure, Security & Fraud services to establish proactive and automated monitoring/alerting for chronic root causers, establish get-well/ improvement plans and driving established improvement plans through to resolution.

Minimum Qualifications

  • 8+ years related experience with a bachelor's degree in Computer Science, Information Systems or related field.
  • 6+ years of progressive experience in one or more of the following areas: application delivery; subject matter expertise in building Java-based high-volume/high-transaction e-commerce applications
  • 6+ years of experience building web applications using HTML5/CSS3/Javascript
  • 3+ years of experience working with front end frameworks such as React, Angular

Preferred Qualifications

  • 4+ years of experience in architecture and design of systems using Micro services architecture
  • 4+ years of experience in a leadership capacity - coaching and mentoring engineers, developers
  • 2+ years of experience working with SPA/PWA architectures
  • 2+ years of experience with server-side rendering technologies and architectures
  • 2+ years of experience in cloud technologies: AWS, Azure, OpenStack, Docker, Kubernetes, Ansible, Chef or Terraform
  • 2+ years of experience in build and CICD technologies: GitHub, Maven, Jenkins, Nexus or Sonar
  • 4+ years of experience in Unit and Function testing using Junit, Spock, Mockito/JMock, Selenium, Cucumber, SoapUI or Postman
  • Proficiency in Unix/Linux command line
  • Expert knowledge and experience working with asynchronous message processing, stream processing and event driven computing.
  • Experience working within Agile/Scrum/Kanban development team
  • Excellent written and verbal communication skills with demonstrated ability to present complex technical information in a clear manner to peers, developers, and senior leaders

Technical Skills

HTML5, CSS3, Javascript, React, Nextjs, Angular, Nodejs, REST services, NoSql technologies (Cassandra/MongoDb), Kafka/MQ/Rabbit, Redis/Hazelcast, Git, Jira, Jenkins, Docker, Kubernetes

AT&T is leading the way to the future - for customers, businesses and the industry. We're developing new technologies to make it easier for our customers to stay connected to their world. Together, we've built a premier integrated communications and entertainment company and an amazing place to work and grow. Team up with industry innovators every time you walk into work, creating the world you always imagined. Ready to #transformdigital with us? Apply now!

Job ID 2040189 Date posted 05/16/2021

Keywords: AT&T, Dallas , Principal Software Site Reliability Engineer - Problem Management & RCA, Other , Dallas, Texas

Click here to apply!

Didn't find what you're looking for? Search again!

I'm looking for
in category

Log In or Create An Account

Get the latest Texas jobs by following @recnetTX on Twitter!

Dallas RSS job feeds