How are GPU infrastructures managed at OpenAI

How are GPU infrastructures managed at OpenAI

Tags
Machine Learning
GPU
Infrastructure
Published
March 12, 2023
Author
yanbc

The challenges

As of January 2021, OpenAI's Kubernetes cluster comprised 7,500 nodes. With the surging popularity of ChatGPT and the ever-increasing demand for OpenAI APIs, the number of machines they need to manage is expected to grow even further. The task of managing such a vast number of nodes within a single Kubernetes cluster can be daunting and awe-inspiring. Let's delve into the challenges faced by OpenAI and how did they solve them. Specically, I am going to focus on two challenges in this post:
  1. High pressure on node management: The sheer volume of nodes creates immense pressure on their management. From provisioning and configuration to ensuring optimal utilization and availability, meticulous planning and execution are necessary. The scale of OpenAI's operations intensifies the complexity of managing these nodes.
  1. Unreliable GPU infrastructures: GPUs, crucial for accelerating complex computations and AI workloads, can be unreliable. Personal experiences have shown that unpredictable errors can render a GPU inaccessible, rendering the associated node unusable. This uncertainty adds an extra layer of complexity and requires swift identification and resolution of GPU-related issues.

1. Ease pressure on node management

OpenAI employs various measures to alleviate the pressure associated with managing nodes, focusing particularly on etcd and API servers.
  1. Utilizing Directly Connected SSD Storage for etcd: OpenAI's infrastructure team conducted tests and found that utilizing network-attached storage severely limited etcd's performance, allowing it to utilize only around 10% of the available IOPS. This limitation stemmed from the high network latency of 2 milliseconds and the sequential nature of etcd's I/O operations. To address this, OpenAI transitioned to locally attached SSD storage, reducing the latency to 200 microseconds and significantly improving the health of etcd.
  1. Running etcd and API servers in Standalone Nodes: In the case of OpenAI, they dedicated separate standalone nodes for both etcd and API servers. They allocated five nodes each for etcd and API servers, positioned outside of their main cluster. This distribution helps distribute the workload and minimizes the impact of potential failures. It is worth noting that OpenAI's cluster comprises 7,500 nodes, allowing for roughly one API server to handle around 1,500 nodes.

2. Working with unreliable GPU infrastructures

OpenAI employs a comprehensive approach to monitor GPU infrastructures, utilizing both passive and active health checks. Let's explore these strategies in detail:
  1. Passive health checks: OpenAI follows standard practices by leveraging Prometheus for metrics exposure and utilizing Grafana for visualization and alerts. However, they go beyond the basics by taking advantage of the DCGM (Data Center GPU Manager) and NVML (NVIDIA Management Library) libraries to gather metrics specifically for NVIDIA GPUs. It's worth noting that as the number of nodes in their cluster increased, they encountered challenges with the sheer volume of collected metrics. To address this, they disabled some metrics exposed through Prometheus rules and focused on the most useful ones, which I totally agree with. As the wise man has once said, too much information is no information at all!
  1. Active GPU tests: In addition to passive Prometheus metrics, OpenAI implemented custom GPU tests to be executed during instance creation and periodically using Kubernetes CronJobs. These tests run whenever a GPU instance is created, ensuring that only healthy nodes are added to the cluster. Furthermore, the CronJob is configured to run on random nodes within the cluster, striking a balance between cost-efficiency and thorough testing.

Further reading

To delve deeper into the management of OpenAI's Kubernetes cluster by their infrastructure team, I recommend reading the following blogs