About the Role
Design and develop AI-driven solutions, particularly using generative AI and large models, for automated incident triage, root cause analysis, and resolution recommendations within Prime Video's observability systems.
Requirements
Requires expertise in designing and developing machine learning and generative AI systems for automated incident triage, root cause analysis, and resolution recommendations, with experience in large models and evaluation frameworks.
Full Job Description
Come build the future of entertainment with us. Are you interested in shaping the future of movies and television? Do you want to define the next generation of how and what Amazon customers are watching?
Prime Video is a premium streaming service that offers customers a vast collection of TV shows and movies - all with the ease of finding what they love to watch in one place. We offer customers thousands of popular movies and TV shows including Amazon Originals and exclusive licensed content to exciting live sports events. We also offer our members the opportunity to subscribe to add-on channels which they can cancel at anytime and to rent or buy new release movies and TV box sets on the Prime Video Store. Prime Video is a fast-paced, growth business - available in over 200 countries and territories worldwide. The team works in a dynamic environment where innovating on behalf of our customers is at the heart of everything we do. If this sounds exciting to you, please read on.
The Observability and Triage team is looking for an Applied Scientist for our London office experienced in generative AI and large models. This is a wide impact role working with development teams across the UK, India, and the US. This greenfield project will deliver features that reduce the operational load for internal Prime Video builders and for this, you will develop AI-driven solutions that automatically detect anomalies, identify root causes, recommend resolution paths and take action for operational incidents. We consume petabytes of data daily across multiple metric, log and data based events and you would be experimenting on how to shape the future through this data.
You will have strong technical ability, excellent teamwork and communication skills, and a strong motivation to deliver customer value from your research. Our position offers opportunities to grow your technical and non-technical skills and make a global impact.
Key job responsibilities
- Design and develop machine learning and generative AI systems for automated incident triage, root cause analysis, and resolution recommendation at scale
- Rapidly prototype and evaluate hypotheses in a high-ambiguity environment, leveraging both quantitative experimentation and domain expertise in operational systems
- Build evaluation frameworks (including LLM-as-a-Judge approaches) to measure model accuracy across triage accuracy and root cause prediction
- Collaborate with software engineering teams to integrate ML models into production observability systems serving hundreds of development teams
- Communicate results and insights to both technical and non-technical audiences, including through publications, presentations, and written reports
A day in the life
On a typical day, you analyse patterns across thousands of operational incidents to improve an automated triage model, then design an experiment to test whether a new Generative-AI based approach better identifies root causes for complex multi-service incidents. Your internal customers are Prime Video development teams who rely on your solutions to reduce the time and effort spent responding to operational events. You will collaborate closely with software engineers, and operational stakeholders across the world to ensure your research translates into production systems that measurably remove customer impact.
About the team
Our team builds AI-powered observability and triage solutions for Prime Video development teams, consuming petabytes of data daily to automatically detect, diagnose, and recommend resolutions for operational incidents.