Meet Coupang’s Machine Learning Platform (2024)

Table of Contents

How Coupang’s ML Platform accelerates ML development for Coupang products Table of Contents 1. Reduce time to production 2. Incorporate CI/CD in ML development 3. Scale ML compute efficiently 1. Notebooks & ML Pipeline Authoring 2. Feature Engineering 3. Model Training 4. Model Inference 5. Monitoring & Observability 6. Training & Inference Clusters 1. Training Ko-BERT to understand search queries better 2. Real-time price forecasting of products FAQs References

How Coupang’s ML Platform accelerates ML development for Coupang products

Coupang Engineering

Published in

Coupang Engineering Blog

1. Reduce time to production

Before Coupang ML Platform, authoring and training a ML model required hours of non-trivial setup work and boilerplate code for preparing data, features and writing trainer code. Tasks like scaling model training through distributed training, using GPUs took deep engineering work resulting in duplicate stack.
Deploying the ML model for serving real-time traffic took weeks of effort, replicating logic for model benchmarking, auto-scaling, security and rollback. These were blockers for product groups to adopt ML at a larger scale. By leveraging ML Platform lifecycle services, one can train, debug and deploy simple to complex models in production within days in a standardized way.

2. Incorporate CI/CD in ML development

ML development can quickly incur heavy technical debt. To make it easier for ML teams to build, deploy and maintain models, we provide integration tested prepackaged containers with popular ML libraries.
Moreover, we provide libraries to validate model, add canary in model deployment and monitoring primary metrics during serving.

3. Scale ML compute efficiently

There is surging demand for compute in Coupang — GPUs for deep learning training, storage for large datasets, and network bandwidth for distributed training. Cloud costs are high, given the large fleet of models training on the platform. Coupang ML Platform team manages a hybrid setup with compute and storage clusters on on-prem and AWS. The on-prem setup provides more customization and a powerful GPU cluster at lower costs while the cloud setup can scale on demand if on-prem resources are insufficient.

Meet Coupang’s Machine Learning Platform (4)

1. Notebooks & ML Pipeline Authoring

ML platform provides a hosted, containerized notebook service for developers to iterate on their ideas. The notebook can be launched using custom or standard containers on CPUs or GPUs.

A set of standard docker containers are maintained by the platform team containing popular ML libraries such as Tensorflow, Pytorch, Sklearn, Huggingface, Transformers, etc. The docker containers help in avoiding dependency complexity and help in writing repeatable pipelines.

For pipeline authoring, the platform provides a set of Python SDKs for data fetching, feature-store, training, and inference.

2. Feature Engineering

Coupang ML Platform offers a feature-store built to access prepared features easily in both offline and online modes. The feature store is built on top of the popular open-source project Feast.

Offline feature stores are used to share prepared features and are also used for model training. We are working with teams to onboard fundamental features such as customer insights which can be consumed by multiple downstream teams.
Online feature store is used to fetch features with low latency during inference. This serves as a model feature generator as well as prediction response cache for compute-intensive models.

3. Model Training

ML teams at Coupang use different modeling frameworks, from the popular ones such as Pytorch, Tensorflow, Sklearn, XGBoost, to the niche ones such as Prophet for forecasting.

The training stack is agnostic of framework. User written pipelines are containerized and launched on the Kubernetes cluster. A batch scheduler schedules the jobs on the desired hardware setup. Users can configure their jobs to run on any CPU type or GPU type available in the cluster. This is very useful as jobs can benefit from various CPU and GPU types depending on their characteristics and can optimize the return on investment. For example, users can configure their model training and batch inference tp rim on different GPU types, optimizing itself for speedup vs. cost of GPU.
The scheduler is configured to follow all-or-nothing resource allocation strategy. The training stack supports distributed training strategies (distributed data parallel and fully sharded data parallel) to train large models. Multi-GPU training has sped up model training workloads significantly across Coupang.

It requires significant effort to tune trainer parameters to efficiently train deep learning models. As the platform team, we benchmark trainers for popular model architectures used internally and share the most effective techniques and best practices amongst all groups who use the platform.

4. Model Inference

Post training, a model is deployed for experimentation or production for serving real traffic. The Seldon platform is used on Kubernetes for model inference. Seldon has integrations with serving libraries such as TFServing and Triton while it can also support custom python wrappers. Through this, it can cover a wide range of model frameworks, runtimes and hardware (CPU & GPU Serving).

Each ML model can be deployed as a standalone service with autoscaling. Deploying each ML model as a service provides isolation and allows integration with standard CI/CD infrastructure. Model deployment jobs run multiple validation tests (model size, training-prediction skew tests, etc) before moving into a canary phase. If canary results are successful, the model can be gradually rolled out. Developers need minimum effort (adding hooks for model validation and canary results verification) to safely get their model serving production traffic.

To serve compute-intensive features, such as embedding, in real time with low latency, we use the online feature store mentioned above. For very large models (LLMs, multimodal models), we are investing in batch and real-time GPU based serving which provides a high throughput compared to CPU serving.

Meet Coupang’s Machine Learning Platform (5)

Meet Coupang’s Machine Learning Platform (6)

5. Monitoring & Observability

All Coupang ML Platform services have monitoring enabled. Training cluster has resource and job monitoring dashboards (GPUs, CPUs, Memory in use). There are GPU and CPU utilization metrics for workloads.
Inference service has runtime monitoring for memory usage, prediction scores. We have plans to introduce data quality checks (anomaly detection, drift monitoring) across feature and model serving.
Cluster usage dashboards are used by developers to understand resource allocations and scheduling delays. For error debugging, the application and resource usage logs are collected from clusters and made available to developers through dashboards. There also are alerts set up for various events such as stuck or idle training jobs, inability to launch instances for training, serving or memory spikes.

Meet Coupang’s Machine Learning Platform (7)

Meet Coupang’s Machine Learning Platform (8)

6. Training & Inference Clusters

In the era of large datasets and deep learning models, hardware (especially accelerators such as GPU) plays a crucial role in ML development. Through an active collaboration with the the cloud infrastructure engineers at Coupang, we provide compute and storage clusters in the on-prem data center and AWS cluster.

Meet Coupang’s Machine Learning Platform (9)

Training requires instances with large memory, accelerators such as GPUs, high bandwidth connection between nodes for distributed training, and a shared storage cluster to store training data and output artifacts such as model checkpoints.

Serving requires high I/O throughput machines for performance and availability. We have a dedicated set of machines optimized for serving in multiple availability zones. Autoscaling ensures that the cluster can handle traffic spikes.

Through our partnership with ML teams at Coupang, we are able to systematically scale solutions which have been proven in one domain and can be generalized.

The following is a couple of recent customer success stories supported by Coupang ML Platform:

1. Training Ko-BERT to understand search queries better

ML developers working in search and recommendations launched embedding-based retrieval to augment classical term matching-based retrieval. Multi-GPU distributed training on A100 GPUs provided 10x speed up for BERT training compared to older generation GPUs and training strategy.

After success of BERT, the developers are experimenting with finetuned large language models (LLMs) to improve search quality across different surfaces. Large Language model finetuning exercises various parts of the ML platform — efficient cluster usage, distributed training strategies, high throughput GPU-based inference, etc.
We have been fairly successful in adapting and democratizing the new ML innovations through our platform.

2. Real-time price forecasting of products

Data science teams in Customer and Growth model various time series data for forecasting price, demand, page-view amongst others. The team onboarded their entire suite of pricing models from custom inference stack to our ML Platform serving. The team no longer has to maintain their deployment cluster. They can focus entirely on developing better models.

Even though we are still early in our journey, we see good traction in customers using the services as building blocks in their ML pipeline. Over the past year, there have been 100K+ workflow runs on the platform spanning 600+ ML projects. We saw massive increase in size of models being experimented on resulting in several wins in quality of Coupang services. All major ML groups at Coupang use one or more Coupang ML Platform services.
We see developers building domain-specific toolkits on the Coupang ML Platform, such as language modeling and AutoML. There has been strong interest in and adoption of CI/CD in best practice features such as online feature store and monitoring.

The coming posts will describe Coupang’s core services and applications supported by them in more detail. If you are interested in tackling Machine Learning and infra challenges that enable developers to solve hundreds of business problems and improve customer experience, consider applying for a role on our team!

Meet Coupang’s Machine Learning Platform (2024)

FAQs

How do I run a machine learning program? ›

Follow along to:

Load the data and explore it with visualisations;
Prepare the data for the machine learning algorithm;
Train the model – let the algorithm learn from the data;
Evaluate the model – see how well it performs on data it has not seen before;
Analyse the model – see how much data it needs to perform well.

View Details ›

How do I join machine learning? ›

How to break into machine learning

Learn essential math skills. ...
Study basic computer science skills. ...
Earn any necessary degrees. ...
Learn a programming language. ...
Learn specifics about machine learning. ...
Practice with existing datasets. ...
Work on your projects and build your portfolio. ...
Join a community and attend conferences.

More items...

Nov 15, 2023

What is the salary of an AI engineer? ›

Average Annual Salary

Very High Confidence means the data is based on a large number of latest salaries. AI Engineer salary in India ranges between ₹ 3.0 Lakhs to ₹ 22.0 Lakhs with an average annual salary of ₹ 12.5 Lakhs. Salary estimates are based on 1k latest salaries received from AI Engineers.

View Details ›

How can I make money with machine learning? ›

Financial Wolves have done a great job in exploring some ways you can profit from machine learning, here they are:

Land gigs with FlexJobs. ...
Become a freelancer or list your company to hire a team on Toptal. ...
Develop a Simple AI App. ...
Become an ML Educational Content Creator. ...
Create and Publish an Online ML Book.

More items...

Jan 9, 2023

Get More Info Here ›

What program is best for machine learning? ›

They are:

Microsoft Azure Machine Learning.
Amazon SageMaker.
BigML.
TensorFlow.
PyTorch.
Apache Mahout.
Weka.
Vertex AI.

Get More Info ›

What is better, PyTorch or TensorFlow? ›

TensorFlow and PyTorch each have special advantages that meet various needs: TensorFlow offers strong scalability and deployment capabilities, making it appropriate for production and large-scale applications, whereas PyTorch excels in flexibility and ease of use, making it perfect for study and experimentation.

Learn More ›

Which cloud platform is best for machine learning? ›

Google Cloud AI Platform

Its proficiency in handling large-scale tasks is evident in its vast resource pool, ability to manage heavy data flow, and quick scalability, underscoring its position as the best for large-scale machine learning tasks.

See Details ›