AWS - Pranav

I was fortunate to intern on AWS’s SageMaker / Bedrock ML Infrastructure team during Summer 2025, working with incredible mentors (Jonathan, Javion) and my manager (Qiyun). This was my first big-tech internship, and it gave me a front-row seat to how large-scale ML systems are built and operated in production.

Setting the Stage

SageMaker and Bedrock power large-scale machine learning and generative AI workloads for customers across AWS. My team owned the core infrastructure behind model serving, spanning hardware, software, and developer-facing APIs. With Bedrock rapidly evolving and recent org restructuring, there was significant opportunity to make meaningful, high-impact contributions.

My Contributions

My primary project focused on observability for Bedrock’s model serving infrastructure, a critical gap at the time. With over 1M+ SageMaker endpoints powering Bedrock models, diagnosing production issues was slow and opaque; engineers lacked visibility into which dependent AWS services were failing.

To solve this, I:

Built a Go-based CLI tool that mapped endpoints to their downstream AWS service dependencies, giving on-call engineers a clear service topology for faster root-cause analysis.
Designed a real-time system health monitoring framework, enabling live visibility into infrastructure health and reducing on-call response times by ~70% through proactive issue detection.
Containerized audit infrastructure using AWS Copilot, modernizing deployments and improving scalability to reliably handle 10K+ audit queries per week.

These tools became core components of the team’s production and on-call workflows.

Seattle + Bellevue Pit Stop

Beyond technical work, I cherished the in-person experience at AWS’s Seattle and Bellevue offices. From exploring downtown, hiking trails, playing poker and everything in between, I soaked in the beauty this opportunity had to offer.

Some of my favorite highlights:

Hiking Lake Serene and exporing its hidden charm.
Late night poker games with fellow interns.
Pickleball + Boba runs after work.

What I Learned

This internship taught me how observability, reliability, and operational rigor are just as critical as core functionality in large-scale ML systems. I learned how to design tooling for real production constraints, collaborate within a massive org, and ship systems that meaningfully improve developer velocity and customer experience.

Pranav Varshney

Setting the Stage

My Contributions

Seattle + Bellevue Pit Stop

What I Learned