Introducing GPUStack 0.2: heterogeneous distributed inference, CPU inference and scheduling strategies

Introducing GPUStack 0.2

GPUStack is an open-source GPU cluster manager for running large language models (LLMs). It enables you to create a unified cluster from GPUs across various platforms, including Apple MacBooks, Windows PCs, and Linux servers. Administrators can deploy LLMs from popular repositories like Hugging Face, and developers can access these models as easily as they would access public LLM services from providers such as OpenAI or Microsoft Azure.

 

Since its launch at the end of July, GPUStack has been well received by the community, generating substantial feedback. Based on these insights and our roadmap, we have now released GPUStack 0.2. This version introduces significant features, including multi-GPU inference, distributed inference across workers, CPU inference, binpack and spread placement strategies, as well as the ability to assign specific workers and manually select GPUs. Additionally, support for Nvidia GPUs has been expaneded, and several community-reported issues have been addressed to better accommodate diverse use cases.

 

For more information about GPUStack, visit:

GitHub repo: https://github.com/gpustack/gpustack

User guide: https://docs.gpustack.ai

 

Key features

Distributed inference

One of the standout features of GPUStack 0.2 is its out-of-the-box support for multi-GPU inference and distributed inference across workers. Administrators can run LLMs across multiple GPUs or workers without complex configurations, making it possible to handle models that exceed the capacity of a single GPU.

 

Distributed Inference Across Multiple GPUs

In 0.1, when no single GPU in GPUStack could meet a model’s resource requirements, GPUStack employed a partial offloading method , utilizing both CPU and GPU for inference. However, this approach affected performance due to reliance on the CPU, which couldn't meet high-performance inference requirements.

 

To improve this, GPUStack 0.2 introduces multi-GPU inference, allowing models to be distributed across multiple GPUs, with each GPU processing different layers of the model. This enables administrators to run LLMs with larger parameter sizes while ensuring enhanced performance and efficiency.

 

Distributed Inference Across Workers

To support large-scale models like Llama 3.1 405B, Llama 3.1 70B, Qwen2 72B, GPUStack 0.2 adds distributed inference across workers. When a single worker cannot meet a model’s resource requirements, GPUStack distributes the model across multiple workers, enabling for inference across different hosts.

 

However, inference performance may be contrained by cross-host network bandwidth, potentially causing significant slowdowns. To achieve optimal performance, it’s recommended to use high-performance networking solutions such as NVLink/NVSwitch or RDMA. For consumer-grade setups, Thunderbolt interconnect solutions are also an viable option.

 

Additionally, since larger models from Hugging Face often require splitting due to their size, GPUStack 0.2 now supports downloading and running sharded models, facilitating efficient handling of these larger models.

image-20240914151153571

 

CPU inference

GPUStack 0.2 introduces support for CPU inference. When no GPUs are available or GPU resources are insufficient, GPUStack can fall back to CPU based inference, loading the entire model into memory and running it on the CPU. This allows administrators to deploy LLMs with smaller parameter sizes in environments without GPUs, expanding GPUStack’s user cases in edge and resource-constrained scenarios.

 

Scheduling strategies

Binpack Placement Strategy to Reduce Resource Fragmentation

The binpack placement strategy is a compact scheduling approach, and was the default strategy for deploying models in GPUStack 0.1. When a GPU meets a model’s resource requirements, this strategy consolidates multiple replicas of the model onto the same GPU to maximize utilization. It continues to schedule additional model instances on that GPU until its remaining resources are insufficient, at which point it selects another GPU.

 

The binpack strategy reduces resource fragmentation and optimizes GPU usage. Fragmentation refers to small amounts of unused resources on various GPUs that are insufficient to accommodate new models, leading to wasted computational resources. By concentrating models on fewer GPUs, the binpack strategy allowing other GPUs to maintain their full computational capacity for handling larger models.

 

Spread Placement Strategy to Improve Load Balancing

While the binpack strategy reduces resource fragmentation and maximizes GPU utilization, it can sometimes lead to overloading a few GPUs while leaving others underutilized. To address this, GPUStack 0.2 introduces the spread placement strategy.

 

Unlike binpack’s compact scheduling, the spread strategy aims to distribute models evenly across multiple GPUs. This prevents excessive resource concentration on a single GPU and ensures a more balanced load across all GPUs. As a result, it reduces performance bottlenecks caused by resource contention and improves overall model performance and stability.

 

Under the spread strategy, models are first assigned to GPUs with lower loads, ensuring that all GPUs contribute to inference tasks. This strategy is especially beneficial in high-concurrency or high-performance scenarios, improving cluster elasticity and avoiding overload on individual GPUs when sufficient resources are available. While GPUStack 0.2 defaults to the spread strategy, administrators can choose the strategy that best fits their needs.

image-20240914151222404

 

Assign specific workers

In GPUStack 0.2, administrators can label workers and use the worker selector to assign models to specific workers based on these labels during deployment. This allows for precise control over model deployment, optimizing resource allocation to align with specific needs or strategies.

 

This feature is particularly valuable for fine-grained resource management, such as deploying models to GPUs of a specific brand or type in a heterogeneous environment. By leveraging label selection, GPUStack enhances resource management efficiency in complex environments, providing increased flexibility and precision in model deployment.

image-20240914151323244

 

Manually assign specific GPUs

One of GPUStack’s key features is its ability to automatically calculate model resource requirements and schedule models. This removes the need for administrators to manually allocate resources or manage scheduling. GPUStack 0.2 supports various scheduling strategies, such as binpack and spread placement, multi-GPU inference, distributed inference across workers, and assigning models to specific workers, which granting administrators comprehensive control over model scheduling.

 

GPUStack’s scheduling capabilities continue to evolve to meet diverse use cases. In 0.2, manual scheduling is now available , allowing administrators to assign models to specified GPUs manually, providing even finer control over model resource management.

image-20240914151716151

 

Option to enable CPU offloading for hybrid or CPU inference

In 0.1, when a model couldn't be fully offloaded onto a GPU due to insufficient VRAM, GPUStack would automatically offload part of the model to the GPU while loading the remaining layers into memory for CPU-based inference. This method is called partial offloading or hybrid inference.

 

This allowed LLMs to run even with limited VRAM, though overall performance suffered due to CPU dependency. In performance-sensitive scenarios, it was challenging for administrators to gauge whether a model was fully loaded onto the GPU, making it difficult to determine if additional GPU resources were needed for optimal performance.

 

In 0.2, administrators can decide whether to enable CPU offloading, which is disabled by default to prioritize GPU inference. If no GPU meets the model's resource requirements, the model will remain in a pending state until an available GPU is found. Administrators can also opt to enable CPU offloading for hybrid inference or CPU inference, though this may come with performance trade-offs.

image-20240914151401568

 

Other new features

Support for Nvidia GPUs with Compute Capability 6.0, 6.1, 7.0, 7.5

In 0.2, GPUStack expanded its Nvidia GPU support, adding compatibility for GPUs with compute capability of 6.0, 6.1, 7.0, and 7.5. This includes models such as the NVIDIA T4, V100, Tesla P100, P40, P4, and the GeForce GTX 10 and RTX 20 series. This expansion allows GPUStack to better address both data center and consumer-level use cases.

 

GPUStack now supports all Nvidia GPUs with compute capability from 6.0 to 8.9. For more information, refer to Nvidia’s GPU compute capability guide: https://developer.nvidia.com/cuda-gpus.

 

For other enhancements and bug fixes, see the full changelog:

https://github.com/gpustack/gpustack/releases/tag/0.2.0

 

 

Join Our Community

Please find more information about GPUStack at: https://gpustack.ai.

 

If you encounter any issues or have suggestions, feel free to join our Community to get support from the GPUStack team and connect with users from around the world.

 

We are continuously improving the GPUStack project. Before getting started, we encourage you to follow and star our project on GitHub at gpustack/gpustack to receive updates on future releases. We also welcome contributions to the project.

 

Related Articles