I was fortunate to intern on AWS’s SageMaker / Bedrock ML Infrastructure team during Summer 2025, working with incredible mentors (Jonathan, Javion) and my manager (Qiyun). This was my first big-tech internship, and it gave me a front-row seat to how large-scale ML systems are built and operated in production.

Setting the Stage

SageMaker and Bedrock power large-scale machine learning and generative AI workloads for customers across AWS. My team owned the core infrastructure behind model serving, spanning hardware, software, and developer-facing APIs. With Bedrock rapidly evolving and recent org restructuring, there was significant opportunity to make meaningful, high-impact contributions.

My Contributions

My primary project focused on observability for Bedrock’s model serving infrastructure, a critical gap at the time. With over 1M+ SageMaker endpoints powering Bedrock models, diagnosing production issues was slow and opaque; engineers lacked visibility into which dependent AWS services were failing.

To solve this, I:

These tools became core components of the team’s production and on-call workflows.

Seattle + Bellevue Pit Stop

Beyond technical work, I cherished the in-person experience at AWS’s Seattle and Bellevue offices. From exploring downtown, hiking trails, playing poker and everything in between, I soaked in the beauty this opportunity had to offer.

Some of my favorite highlights:

Space Needle
Downtown Seattle
Cute Cat @ Cat Library

What I Learned

This internship taught me how observability, reliability, and operational rigor are just as critical as core functionality in large-scale ML systems. I learned how to design tooling for real production constraints, collaborate within a massive org, and ship systems that meaningfully improve developer velocity and customer experience.