I was fortunate to intern on AWS’s SageMaker / Bedrock ML Infrastructure team during Summer 2025, working with incredible mentors (Jonathan, Javion) and my manager (Qiyun). This was my first big-tech internship, and it gave me a front-row seat to how large-scale ML systems are built and operated in production.
Setting the Stage
SageMaker and Bedrock power large-scale machine learning and generative AI workloads for customers across AWS. My team owned the core infrastructure behind model serving, spanning hardware, software, and developer-facing APIs. With Bedrock rapidly evolving and recent org restructuring, there was significant opportunity to make meaningful, high-impact contributions.
My Contributions
My primary project focused on observability for Bedrock’s model serving infrastructure, a critical gap at the time. With over 1M+ SageMaker endpoints powering Bedrock models, diagnosing production issues was slow and opaque; engineers lacked visibility into which dependent AWS services were failing.
To solve this, I:
- Built a Go-based CLI tool that mapped endpoints to their downstream AWS service dependencies, giving on-call engineers a clear service topology for faster root-cause analysis.
- Designed a real-time system health monitoring framework, enabling live visibility into infrastructure health and reducing on-call response times by ~70% through proactive issue detection.
- Containerized audit infrastructure using AWS Copilot, modernizing deployments and improving scalability to reliably handle 10K+ audit queries per week.
These tools became core components of the team’s production and on-call workflows.
Seattle + Bellevue Pit Stop
Beyond technical work, I cherished the in-person experience at AWS’s Seattle and Bellevue offices. From exploring downtown, hiking trails, playing poker and everything in between, I soaked in the beauty this opportunity had to offer.
Some of my favorite highlights:
- Hiking Lake Serene and exporing its hidden charm.
- Late night poker games with fellow interns.
- Pickleball + Boba runs after work.
What I Learned
This internship taught me how observability, reliability, and operational rigor are just as critical as core functionality in large-scale ML systems. I learned how to design tooling for real production constraints, collaborate within a massive org, and ship systems that meaningfully improve developer velocity and customer experience.