New offer - be the first one to apply!

August 1, 2025

Sr Software Dev Engineer, Edge AI ML Platform (Level 6), Edge AI

Senior • On-site

$151,300 - $261,500/yr

Sunnyvale, CA

DESCRIPTION

Are you passionate about building infrastructure that trains the next generation of large language models for edge devices? Join our Edge AI team at Amazon Devices (Lab126) where you'll architect and implement distributed training systems that scale to hundreds of billions of parameters. Your work will enable novel distillation and compression techniques that transform these massive models into efficient versions that run on constrained edge devices.

Lead the development of our distributed training platform for large language models up to 400B parameters
Design high-performance training systems that produce models optimized for edge deployment
Collaborate with ML scientists to create compression pipelines that maintain model quality while reducing size
Drive innovation in both large-scale training and edge-optimized model deployment

Key job responsibilities
- Architect and implement distributed training systems that efficiently scale across hundreds or thousands of GPUs
- Design and optimize data parallelism, tensor parallelism, and pipeline parallelism strategies for large language models
- Implement memory optimization techniques like activation recomputation, ZeRO, and mixed precision training
- Develop infrastructure that supports novel distillation and compression techniques for edge deployment
- Create evaluation frameworks to measure performance of compressed models on target edge hardware
- Collaborate with ML scientists to optimize training for downstream compression requirements
- Benchmark and profile training configurations to maximize throughput and GPU utilization
- Build pipelines that connect large-scale training to edge model deployment workflows

A day in the life
You'll start your day analyzing performance metrics from overnight training runs, identifying bottlenecks that are limiting throughput on our GPU clusters. After a quick stand-up with the team, you might pair with an ML scientist to implement a new parallelism strategy that reduces memory usage while maintaining computational efficiency.

In the afternoon, you could collaborate with the model compression team to ensure your training infrastructure produces checkpoints optimized for their distillation pipeline. You might debug a communication issue causing training instability across nodes, then optimize a custom CUDA kernel to improve attention computation speed.

Your work bridges the gap between massive-scale model training and efficient edge deployment, enabling AI capabilities that would otherwise be impossible on resource-constrained devices. By optimizing the training infrastructure, you directly impact how quickly we can iterate on new models and compression techniques, accelerating our path to delivering AI features to millions of Amazon devices.

About the team
The Edge AI team at Lab126 is responsible for developing the next generation of AI capabilities for Amazon devices. We're a diverse group of engineers and scientists working at the intersection of machine learning, distributed systems, and hardware optimization. Our mission is to bring powerful AI capabilities to Amazon devices while maintaining privacy, reducing latency, and optimizing for resource constraints.

We tackle the full AI pipeline - from training massive models at scale to compressing and distilling them for efficient edge deployment. This end-to-end approach allows us to optimize each stage of the process specifically for our target devices, achieving capabilities that would be impossible with off-the-shelf solutions.

Our team culture values deep technical expertise combined with practical problem-solving. We embrace challenges that others might consider impossible, and we're not afraid to question conventional approaches when better solutions exist. We work in a collaborative environment where ideas are valued regardless of title, and we take pride in building systems that scale efficiently from research to production.

BASIC QUALIFICATIONS

- 5+ years of non-internship professional software development experience
- 5+ years of programming with at least one software programming language experience
- 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
- Experience as a mentor, tech lead or leading an engineering team
- Experience with distributed systems or high-performance computing
- Proficiency in Python and at least one systems programming language (C++, Rust, etc.)
- Experience with machine learning frameworks such as PyTorch or TensorFlow
- Understanding of GPU programming and optimization techniques

PREFERRED QUALIFICATIONS

- 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
- Bachelor's degree in computer science or equivalent
- Experience scaling distributed training for large language models (30B+ parameters)
- Deep knowledge of PyTorch internals and distributed training modules
- Hands-on experience with parallelism strategies (Data, Tensor, Pipeline, ZeRO)
- Experience with model compression techniques (quantization, distillation, pruning)
- Experience optimizing GPU memory usage and communication patterns
- Knowledge of CUDA programming and custom kernel development
- Background in cloud infrastructure (AWS, Kubernetes) for ML workloads
- Experience with mixed precision training and quantization techniques

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

Los Angeles County applicants: Job duties for this position include: work safely and cooperatively with other employees, supervisors, and staff; adhere to standards of excellence despite stressful conditions; communicate effectively and respectfully with employees, supervisors, and staff to ensure exceptional customer service; and follow all federal, state, and local laws and Company policies. Criminal history may have a direct, adverse, and negative relationship with some of the material job duties of this position. These include the duties and responsibilities listed above, as well as the abilities to adhere to company policies, exercise sound judgment, effectively manage stress and work safely and respectfully with others, exhibit trustworthiness and professionalism, and safeguard business operations and the Company’s reputation. Pursuant to the Los Angeles County Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.

Our compensation reflects the cost of labor across several US geographic markets. The base pay for this position ranges from $151,300/year in our lowest geographic market up to $261,500/year in our highest geographic market. Pay is based on a number of factors including market location and may vary depending on job-related knowledge, skills, and experience. Amazon is a total compensation company. Dependent on the position offered, equity, sign-on payments, and other forms of compensation may be provided as part of a total compensation package, in addition to a full range of medical, financial, and/or other benefits. For more information, please visit https://www.aboutamazon.com/workplace/employee-benefits. This position will remain posted until filled. Applicants should apply via our internal or external career site.