Solution Study I Lane Detection for AVs – Distributed Training at Scale

Session

Monday, June 20

09:00 AM - 09:30 AM

Live in San Francisco

Less Details

The deep learning models driving innovation in autonomous vehicles are becoming more ambitious by the day, but their supporting infrastructures often struggle to keep up. Because a single GPU can’t accommodate the complex neural networks of enterprise AV projects, distributed training has emerged as the solution for training DL models and large data sets. In distributed training, storage, compute power and batch size are magnified with each GPU added to the cluster, dramatically reducing training time.

In this talk, we address a lane detection use case where Run:ai, Microsoft, and NetApp jointly built a distributed training DL solution at scale that runs in the Azure cloud. This solution enables data scientists to fully embrace the Azure cloud scaling capabilities and cost benefits for automotive use cases.

You’ll learn:

The compute challenges you can expect to face as your teams and models scale
How we managed to keep tens of GPUs fully occupied at 95% to 100% utilization
The role of resource allocation in accelerating model training in the cloud

Speaker

Jonathan Cosme

AI/ML Solutions Architect, Run:ai

Since its inception in 2018, Run:ai has continued to break through the known limits of GPU technology, releasing multiple new capabilities in rapid succession, such as fractional GPU allocation, thin GPU provisioning, job swapping, and dynamic scheduling for NVIDIA’s Multi-Instance GPU (MIG) technology. It is the only AI infrastructure solution boasting near-100% GPU utilization for its enterprise customers. As cited in The Forrester Wave: AI Infrastructure, Q4 2021, Run:ai offers enterprises “complete flexibility in the hardware they choose to use and where they choose to run it.

Join Jonathan Cosme
Book Now!

Company

Run:ai

Run:ai’s cloud-native compute orchestration platform, Atlas, helps enterprises dramatically reduce the time to train and productize AI models by creating a virtual pool of compute resources and automating allocation. With dynamic, workload-aware scheduling, IT can achieve dreamed-about levels of GPU utilization, and ensure business goals are met with custom prioritization rules and dashboards. Data scientists can start experiments and run hundreds of training jobs without ever touching code. Run:ai partnered with Microsoft and NetApp to address a lane-detection use case by building a distributed training deep learning solution at scale that runs in the cloud. This solution enables data scientists to fully embrace cloud scaling capabilities and cost benefits for automotive use cases. Since its inception in 2018, Run:ai has continued to break through the known limits of GPU technology, releasing multiple new capabilities in rapid succession, such as fractional GPU allocation, thin GPU provisioning, job swapping, and dynamic scheduling for NVIDIA’s Multi-Instance GPU (MIG) technology. It is the only AI infrastructure solution boasting near-100% GPU utilization for its enterprise customers. As cited in The Forrester Wave: AI Infrastructure, Q4 2021, Run:ai offers enterprises “complete flexibility in the hardware they choose to use and where they choose to run it.

Get Agenda

BOOK NOW