Meet Coupang’s Machine Learning Platform (2024)

How Coupang’s ML Platform accelerates ML development for Coupang products

Coupang Engineering

·

Follow

--

By Hyun Jung Baek, Hara Ketha, Jaideep Ray, Justina Min, Mohamed Sabbah, Ronak Panchal, Seshu Adunuthula, Thimma Reddy Kalva, and Enhua Tan

Meet Coupang’s Machine Learning Platform (3)

This post is also available in Korean.

Coupang is reimagining the shopping and delivery experience to wow all the customers from the instant they open the Coupang app to the moment an order is delivered to their door. In addition to e-commerce, Coupang has various other consumer services ranging from Coupang Eats for food delivery, Coupang Play for video streaming, Coupang Pay for payments, and to Coupang Grocery for fresh products amongst others.

Machine Learning (ML) impacts every aspect of e-commerce experiences of Coupang customers: the product catalog, search, pricing, robotics, inventory, and fulfillment. As Coupang ventures into new markets, ML has continued to play an even more important role.

ML helps to power search and discovery across Coupang websites and apps, to price products and services, to streamline logistics and delivery, to optimize content for streaming, to rank ads, and to do many more jobs.

Therefore, we strive to scale machine learning development at all ML lifecycle stages, including ad-hoc exploration, training data preparation, model development, and robust production deployment of models.

Table of Contents

· ML @ Coupang
· Motivation
1. Reduce time to production
2. Incorporate CI/CD in ML development
3. Scale ML compute efficiently
· Core offerings of Coupang ML Platform

2. Feature Engineering
3. Model Training
4. Model Inference


· Success Stories
1. Training Ko-BERT to understand search queries better
2. Real-time price forecasting of products

ML teams at Coupang are actively developing models in Natural Language Processing (NLP), Computer Vision (CV), Recommendations, and Forecasting. NLP is used to understand search queries, product listings, and ads content. Computer vision-enabled image understanding categorizes similar products and ads. Recommendation models rank content for product search, videos in Coupang Play, and product ads. Forecasting techniques help us understand supply, demand, and pricing for millions of products.

This post introduces Coupang’s internal ML platform and describes how the platform supports the increasing scale and diversity of workloads across ML frameworks, programming languages, different model architectures, and training & serving paradigms.

The motivation behind Coupang ML Platform is to provide ‘batteries-included’ services to accelerate ML development through improved developer productivity.

Core services include managed notebooks (Jupyter), pipeline SDK, feature-store, model training, and model inference. ML teams can use the services independently to compose their ML pipeline. Our focus areas are as follow:

1. Reduce time to production

Before Coupang ML Platform, authoring and training a ML model required hours of non-trivial setup work and boilerplate code for preparing data, features and writing trainer code. Tasks like scaling model training through distributed training, using GPUs took deep engineering work resulting in duplicate stack.
Deploying the ML model for serving real-time traffic took weeks of effort, replicating logic for model benchmarking, auto-scaling, security and rollback. These were blockers for product groups to adopt ML at a larger scale. By leveraging ML Platform lifecycle services, one can train, debug and deploy simple to complex models in production within days in a standardized way.

2. Incorporate CI/CD in ML development

ML development can quickly incur heavy technical debt. To make it easier for ML teams to build, deploy and maintain models, we provide integration tested prepackaged containers with popular ML libraries.
Moreover, we provide libraries to validate model, add canary in model deployment and monitoring primary metrics during serving.

3. Scale ML compute efficiently

There is surging demand for compute in Coupang — GPUs for deep learning training, storage for large datasets, and network bandwidth for distributed training. Cloud costs are high, given the large fleet of models training on the platform. Coupang ML Platform team manages a hybrid setup with compute and storage clusters on on-prem and AWS. The on-prem setup provides more customization and a powerful GPU cluster at lower costs while the cloud setup can scale on demand if on-prem resources are insufficient.

Meet Coupang’s Machine Learning Platform (4)

1. Notebooks & ML Pipeline Authoring

ML platform provides a hosted, containerized notebook service for developers to iterate on their ideas. The notebook can be launched using custom or standard containers on CPUs or GPUs.

A set of standard docker containers are maintained by the platform team containing popular ML libraries such as Tensorflow, Pytorch, Sklearn, Huggingface, Transformers, etc. The docker containers help in avoiding dependency complexity and help in writing repeatable pipelines.

For pipeline authoring, the platform provides a set of Python SDKs for data fetching, feature-store, training, and inference.

2. Feature Engineering

Coupang ML Platform offers a feature-store built to access prepared features easily in both offline and online modes. The feature store is built on top of the popular open-source project Feast.

  • Offline feature stores are used to share prepared features and are also used for model training. We are working with teams to onboard fundamental features such as customer insights which can be consumed by multiple downstream teams.
  • Online feature store is used to fetch features with low latency during inference. This serves as a model feature generator as well as prediction response cache for compute-intensive models.

3. Model Training

ML teams at Coupang use different modeling frameworks, from the popular ones such as Pytorch, Tensorflow, Sklearn, XGBoost, to the niche ones such as Prophet for forecasting.

The training stack is agnostic of framework. User written pipelines are containerized and launched on the Kubernetes cluster. A batch scheduler schedules the jobs on the desired hardware setup. Users can configure their jobs to run on any CPU type or GPU type available in the cluster. This is very useful as jobs can benefit from various CPU and GPU types depending on their characteristics and can optimize the return on investment. For example, users can configure their model training and batch inference tp rim on different GPU types, optimizing itself for speedup vs. cost of GPU.
The scheduler is configured to follow all-or-nothing resource allocation strategy. The training stack supports distributed training strategies (distributed data parallel and fully sharded data parallel) to train large models. Multi-GPU training has sped up model training workloads significantly across Coupang.

It requires significant effort to tune trainer parameters to efficiently train deep learning models. As the platform team, we benchmark trainers for popular model architectures used internally and share the most effective techniques and best practices amongst all groups who use the platform.

4. Model Inference

Post training, a model is deployed for experimentation or production for serving real traffic. The Seldon platform is used on Kubernetes for model inference. Seldon has integrations with serving libraries such as TFServing and Triton while it can also support custom python wrappers. Through this, it can cover a wide range of model frameworks, runtimes and hardware (CPU & GPU Serving).

Each ML model can be deployed as a standalone service with autoscaling. Deploying each ML model as a service provides isolation and allows integration with standard CI/CD infrastructure. Model deployment jobs run multiple validation tests (model size, training-prediction skew tests, etc) before moving into a canary phase. If canary results are successful, the model can be gradually rolled out. Developers need minimum effort (adding hooks for model validation and canary results verification) to safely get their model serving production traffic.

To serve compute-intensive features, such as embedding, in real time with low latency, we use the online feature store mentioned above. For very large models (LLMs, multimodal models), we are investing in batch and real-time GPU based serving which provides a high throughput compared to CPU serving.

Meet Coupang’s Machine Learning Platform (5)
Meet Coupang’s Machine Learning Platform (6)

5. Monitoring & Observability

All Coupang ML Platform services have monitoring enabled. Training cluster has resource and job monitoring dashboards (GPUs, CPUs, Memory in use). There are GPU and CPU utilization metrics for workloads.
Inference service has runtime monitoring for memory usage, prediction scores. We have plans to introduce data quality checks (anomaly detection, drift monitoring) across feature and model serving.
Cluster usage dashboards are used by developers to understand resource allocations and scheduling delays. For error debugging, the application and resource usage logs are collected from clusters and made available to developers through dashboards. There also are alerts set up for various events such as stuck or idle training jobs, inability to launch instances for training, serving or memory spikes.

Meet Coupang’s Machine Learning Platform (7)
Meet Coupang’s Machine Learning Platform (8)

6. Training & Inference Clusters

In the era of large datasets and deep learning models, hardware (especially accelerators such as GPU) plays a crucial role in ML development. Through an active collaboration with the the cloud infrastructure engineers at Coupang, we provide compute and storage clusters in the on-prem data center and AWS cluster.

Meet Coupang’s Machine Learning Platform (9)

Training requires instances with large memory, accelerators such as GPUs, high bandwidth connection between nodes for distributed training, and a shared storage cluster to store training data and output artifacts such as model checkpoints.

Serving requires high I/O throughput machines for performance and availability. We have a dedicated set of machines optimized for serving in multiple availability zones. Autoscaling ensures that the cluster can handle traffic spikes.

Through our partnership with ML teams at Coupang, we are able to systematically scale solutions which have been proven in one domain and can be generalized.

The following is a couple of recent customer success stories supported by Coupang ML Platform:

1. Training Ko-BERT to understand search queries better

ML developers working in search and recommendations launched embedding-based retrieval to augment classical term matching-based retrieval. Multi-GPU distributed training on A100 GPUs provided 10x speed up for BERT training compared to older generation GPUs and training strategy.

After success of BERT, the developers are experimenting with finetuned large language models (LLMs) to improve search quality across different surfaces. Large Language model finetuning exercises various parts of the ML platform — efficient cluster usage, distributed training strategies, high throughput GPU-based inference, etc.
We have been fairly successful in adapting and democratizing the new ML innovations through our platform.

2. Real-time price forecasting of products

Data science teams in Customer and Growth model various time series data for forecasting price, demand, page-view amongst others. The team onboarded their entire suite of pricing models from custom inference stack to our ML Platform serving. The team no longer has to maintain their deployment cluster. They can focus entirely on developing better models.

Even though we are still early in our journey, we see good traction in customers using the services as building blocks in their ML pipeline. Over the past year, there have been 100K+ workflow runs on the platform spanning 600+ ML projects. We saw massive increase in size of models being experimented on resulting in several wins in quality of Coupang services. All major ML groups at Coupang use one or more Coupang ML Platform services.
We see developers building domain-specific toolkits on the Coupang ML Platform, such as language modeling and AutoML. There has been strong interest in and adoption of CI/CD in best practice features such as online feature store and monitoring.

The coming posts will describe Coupang’s core services and applications supported by them in more detail. If you are interested in tackling Machine Learning and infra challenges that enable developers to solve hundreds of business problems and improve customer experience, consider applying for a role on our team!

Meet Coupang’s Machine Learning Platform (2024)

FAQs

How do I run a machine learning program? ›

Follow along to:
  1. Load the data and explore it with visualisations;
  2. Prepare the data for the machine learning algorithm;
  3. Train the model – let the algorithm learn from the data;
  4. Evaluate the model – see how well it performs on data it has not seen before;
  5. Analyse the model – see how much data it needs to perform well.

How do I join machine learning? ›

How to break into machine learning
  1. Learn essential math skills. ...
  2. Study basic computer science skills. ...
  3. Earn any necessary degrees. ...
  4. Learn a programming language. ...
  5. Learn specifics about machine learning. ...
  6. Practice with existing datasets. ...
  7. Work on your projects and build your portfolio. ...
  8. Join a community and attend conferences.
Nov 15, 2023

What is the most used learning platform? ›

10 Top online learning platforms
  1. Udemy. With over 213,000 classes and more than 70,000 instructors, Udemy is the largest online learning platform in the world. ...
  2. Coursera. ...
  3. Skillshare. ...
  4. EdX. ...
  5. Udacity. ...
  6. freeCodeCamp. ...
  7. 360 Training. ...
  8. Pluralsight.

Which company has the best machine learning? ›

Here, we look at some of the top companies offering machine learning products in the AI market.
  1. IBM. IBM is a leader in ML, offering services to companies through its Watson Machine Learning, which allows users to build, train and deploy models. ...
  2. NVIDIA. ...
  3. Microsoft. ...
  4. Google. ...
  5. Amazon. ...
  6. SAS. ...
  7. Dataiku. ...
  8. Databricks.
May 14, 2024

How should a beginner start in machine learning? ›

How to get started in machine learning
  1. Build your foundation in math and computer science. Start with learning the basics of math (calculus, algebra, and more) and computer science. ...
  2. Read everything you can about machine learning. ...
  3. Take online courses. ...
  4. Ask for help.
Mar 19, 2024

Is machine learning difficult? ›

Absolutely, it is possible to learn machine learning while working full-time. Many people do so through part-time courses, online tutorials, and self-study. It requires good time management and dedication, as balancing work and learning can be challenging.

What code do I need to learn for machine learning? ›

Three programming languages come up most frequently: C++, Java, and Python, but it can get much more specific as well. Languages like R, Lisp, and Prolog become important languages to learn when specifically diving into machine learning.

Do you need coding for machine learning? ›

The short answer is yes. Traditional machine learning requires you to know software programming, which enables data scientists to write machine learning algorithms. And that takes a lot of time, resources, and manual labor.

Can I learn machine learning on my own? ›

Can You Learn Machine Learning on Your Own? Absolutely. Although the long list of ML skills and tools can seem overwhelming, it's definitely possible to self-learn ML. With the sheer amount of free and paid resources available online, you can develop a great understanding of machine learning all by yourself.

What is the salary of machine learning engineer? ›

Machine Learning Engineer salary in India with less than 1 year of experience to 6 years ranges from ₹ 3.0 Lakhs to ₹ 22.8 Lakhs with an average annual salary of ₹ 10.4 Lakhs based on 6.9k latest salaries.

What is the salary of an AI engineer? ›

Average Annual Salary

Very High Confidence means the data is based on a large number of latest salaries. AI Engineer salary in India ranges between ₹ 3.0 Lakhs to ₹ 22.0 Lakhs with an average annual salary of ₹ 12.5 Lakhs. Salary estimates are based on 1k latest salaries received from AI Engineers.

How can I make money with machine learning? ›

Financial Wolves have done a great job in exploring some ways you can profit from machine learning, here they are:
  1. Land gigs with FlexJobs. ...
  2. Become a freelancer or list your company to hire a team on Toptal. ...
  3. Develop a Simple AI App. ...
  4. Become an ML Educational Content Creator. ...
  5. Create and Publish an Online ML Book.
Jan 9, 2023

What program is best for machine learning? ›

They are:
  • Microsoft Azure Machine Learning.
  • Amazon SageMaker.
  • BigML.
  • TensorFlow.
  • PyTorch.
  • Apache Mahout.
  • Weka.
  • Vertex AI.

What is better, PyTorch or TensorFlow? ›

TensorFlow and PyTorch each have special advantages that meet various needs: TensorFlow offers strong scalability and deployment capabilities, making it appropriate for production and large-scale applications, whereas PyTorch excels in flexibility and ease of use, making it perfect for study and experimentation.

Which cloud platform is best for machine learning? ›

Google Cloud AI Platform

Its proficiency in handling large-scale tasks is evident in its vast resource pool, ability to manage heavy data flow, and quick scalability, underscoring its position as the best for large-scale machine learning tasks.

References

Top Articles
Classic Quiche Lorraine Recipe
33 Vegetarian Thanksgiving Recipes Made with Real Food (Not Tofurkey)
Epguides Succession
This Modern World Daily Kos
Vacature Ergotherapeut voor de opname- en behandelafdeling Psychosenzorg Brugge; Vzw gezondheidszorg bermhertigheid jesu
South Park Season 26 Kisscartoon
Everything You Might Want to Know About Tantric Massage - We've Asked a Pro
Stella.red Leaked
Understanding Pickleball Court Dimensions: Essential Guide
Craigslist Cars And Trucks For Sale Private Owners
College Basketball Predictions & Picks Today 🏀 [Incl. March Madness]
Chs.mywork
Oak Ridge Multibillion Dollar Nuclear Project: Largest Investment in Tennessee History
Oracle Holiday Calendar 2022
Midlands Tech Beltline Campus Bookstore
Apple Store Location
92801 Sales Tax
The Closest Dollar Store To My Location
Chittenden County Family Court Schedule
Synergy Grand Rapids Public Schools
Greater Keene Men's Softball
Alishbasof
P.o. Box 30924 Salt Lake City Ut
Stolen Touches Neva Altaj Read Online Free
Rub Rating Louisville
Ninaisboring
Waitlistcheck Sign Up
Wok Uberinternal
Guide:How to make WvW Legendary Armor
Scrap Metal Prices in Indiana, Pennsylvania Scrap Price Index,United States Scrap Yards
Chatzy Spanking
25+ Irresistible PowerXL Air Fryer Recipes for Every Occasion! – ChefsBliss
16 Things to Do in Los Alamos (+ Tips For Your Visit)
Koinonikos Tourismos
Match The Criminal To The Weapon
Official Klj
O'reilly's In Mathis Texas
Studentvue Paramount
My Compeat Workforce
Craigslist Tools Las Cruces Nm
Hypebeast Muckrack
Barbarian Frenzy Build with the Horde of the Ninety Savages set (Patch 2.7.7 / Season 32)
Showbiz Waxahachie Bowling Hours
Pokimane Boob Flash
Papa Johns Pizza Hours
How To Use Price Chopper Points At Quiktrip
Drift Boss 911
1By1 Roof
Ohio (OH) Lottery Results & Winning Numbers
Discord Id Grabber
Assistant Store Manager Dollar General Salary
Adventhealth Employee Handbook 2022
Latest Posts
Article information

Author: Pres. Carey Rath

Last Updated:

Views: 5987

Rating: 4 / 5 (41 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Pres. Carey Rath

Birthday: 1997-03-06

Address: 14955 Ledner Trail, East Rodrickfort, NE 85127-8369

Phone: +18682428114917

Job: National Technology Representative

Hobby: Sand art, Drama, Web surfing, Cycling, Brazilian jiu-jitsu, Leather crafting, Creative writing

Introduction: My name is Pres. Carey Rath, I am a faithful, funny, vast, joyous, lively, brave, glamorous person who loves writing and wants to share my knowledge and understanding with you.