Running Full Qwen 2.5 Series on GPUStack – Performance and Resource Allocation Review

Qwen 2.5

On September 19th, at the Apsara Conference, Alibaba Cloud released the new generation of open-source large language model, Qwen 2.5. It is surprising to notice that the flagship model Qwen 2.5-72B outperforms Llama 3.1-405B, once again claiming the top of global open-source large language models. Features of Qwen 2.5 are:

  • Dense, easy-to-use, decoder-only language models, available in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B sizes, and base and instruct variants.
  • Pretrained on our latest large-scale dataset, encompassing up to 18T tokens.
  • Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON.
  • More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots.
  • Context length support up to 128K tokens and can generate up to 8K tokens.
  • Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more.

For more details, please refer to the release blog post of Qwen 2.5: https://qwenlm.github.io/blog/qwen2.5/

In this article, we will cover running the entire series of Qwen 2.5 models on GPUStack, including Qwen 2.5, the programming-specific Qwen 2.5-Coder, and the math-focused Qwen 2.5-Math, reviewing their performance and resource consumption.

 

Running the Full Series of Qwen 2.5

Installing GPUStack

Here we will use a Mac Studio and a Ubuntu PC with dual 4080 GPUs to form a two-node heterogeneous GPU cluster, the Mac Studio runs as both Server and Worker, and the Ubuntu PC runs as a Worker. The Server role provides control plane, and the Worker role provides computational resources to run LLMs.

First, install GPUStack on the Mac Studio. GPUStack provides an installation script that allows GPUStack to run as a launchd service on macOS. For more installation scenarios, check the official GPUStack documentation: https://docs.gpustack.ai/.

Run the following command to install GPUStack on the Mac Studio:

When you see the following output, it means you have successfully deployed and started GPUStack.

Next, get the initial admin password for logging in to GPUStack by running the following command:

Go to the browser and access http://myserver (replace the IP address or domain name with the actual value) with the admin username and the initial password obtained previously to log in to GPUStack.

Now, set a new password and log in to GPUStack.

Next, we will add the Ubuntu PC as a worker node to the GPUStack cluster.

Check GPUStack menu and click Resources, then click Add Worker and follow the instructions to continue:

image-20240929171345935

Copy the command to get token and run it on the GPUStack server:

Next, run following command to register the worker and run it on the Ubuntu PC, replacing mytoken with the token obtained in the previous step:

Then the worker can be found in the cluster on the GPUStack:

workers

Now switch to GPUs tab, you can see an Apple M2 Ultra GPU and two NVIDIA RTX 4080 GPUs:

image-20240929171135981

For other installation scenarios, refer to the official GPUStack installation documentation: https://docs.gpustack.ai/quickstart/

 

Running Qwen 2.5

Go to the menu on the left side to navigate to the Models, we will deploy the following models from Hugging Face (all models selected for Qwen 2.5 are quantized using the Q4_K_M method):

  • Qwen/Qwen2.5-0.5B-Instruct-GGUF qwen2.5-0.5b-instruct-q4_k_m.gguf
  • Qwen/Qwen2.5-1.5B-Instruct-GGUF qwen2.5-1.5b-instruct-q4_k_m.gguf
  • Qwen/Qwen2.5-3B-Instruct-GGUF qwen2.5-3b-instruct-q4_k_m.gguf
  • Qwen/Qwen2.5-7B-Instruct-GGUF qwen2.5-7b-instruct-q4_k_m*.gguf
  • Qwen/Qwen2.5-14B-Instruct-GGUF qwen2.5-14b-instruct-q4_k_m*.gguf
  • Qwen/Qwen2.5-32B-Instruct-GGUF qwen2.5-32b-instruct-q4_k_m*.gguf
  • Qwen/Qwen2.5-72B-Instruct-GGUF qwen2.5-72b-instruct-q4_k_m*.gguf
  • Qwen/Qwen2.5-Coder-1.5B-Instruct-GGUF qwen2.5-coder-1.5b-instruct-q4_k_m.gguf
  • Qwen/Qwen2.5-Coder-7B-Instruct-GGUF qwen2.5-coder-7b-instruct-q4_k_m*.gguf

models

Testing Qwen 2.5 Models

Testing on chat models:

image-20240929173500915

 

Testing on Coder models:

image-20240929173806585

 

Testing on Math model:

Checking VRAM and RAM allocation for the models:

  • Qwen 2.5 series models

WX20240929-170706@2x

  • Qwen2.5-Coder 与 Qwen2.5-Math

image-20240929170829417

 

 

Test result is collected as follows:

NameTokens/sTPOTTTFTAllocated VRAMAllocated RAMRemarks
qwen2.5-0.5b-instructRTX 4080:454.7
M2 Ultra:212.11
RTX 4080:2.17 ms
M2 Ultra:4.71 ms
RTX 4080:16.91 ms
M2 Ultra:95.99 ms
1.0 GiB377.1 MiB 
qwen2.5-1.5b-instructRTX 4080:301.48
M2 Ultra:138.69
RTX 4080:3.32 ms
M2 Ultra:7.21 ms
RTX 4080:17.82 ms
M2 Ultra:116.85 ms
1.7 GiB442.8 MiB 
qwen2.5-3b-instructRTX 4080:201.93
M2 Ultra:106.67
RTX 4080:4.95 ms
M2 Ultra:9.38 ms
RTX 4080:21.2 ms
M2 Ultra:168.9 ms
2.6 GiB515.8 MiB 
qwen2.5-7b-instructRTX 4080:124.42
M2 Ultra:76.69
RTX 4080:8.04 ms
M2 Ultra:13.04 ms
RTX 4080:24.31 ms
M2 Ultra:264.97 ms
5.2 GiB741.6 MiB 
qwen2.5-14b-instructRTX 4080:66.13
M2 Ultra:42.31
RTX 4080:15.12 ms
M2 Ultra:23.64 ms
RTX 4080:47.51 ms
M2 Ultra:468.85 ms
9.5 GiB766.6 MiB 
qwen2.5-32b-instruct22.6544.14 ms1436.63 ms20.1 GiB820.8 MiBSingle RTX 4080 unable to run
qwen2.5-72b-instruct11.3388.24 ms2163.06 ms42.8 GiB1.2 GiBSingle RTX 4080 unable to run
qwen2.5-coder-1.5b-instructRTX 4080:297.3
M2 Ultra:138.09
RTX 4080:3.36 ms
M2 Ultra:7.24 ms
RTX 4080:29.34 ms
M2 Ultra:130.49 ms
1.1 GiB292.8 MiB 
qwen2.5-coder-7b-instructRTX 4080:124.42
M2 Ultra:75
RTX 4080:8.04 ms
M2 Ultra:13.33 ms
RTX 4080:39.24 ms
M2 Ultra:294.41 ms
5.2 GiB741.6 MiB 
qwen2.5-math-1.5b-instruct131.367.61 ms119.67 ms1.6 GiB434.8 MiBM2 Ultra GPU
qwen2.5-math-7b-instruct72.0213.89 ms1092.11 ms4.3 GiB583.6 MiBM2 Ultra GPU
qwen2.5-math-72b-instruct10.5295.06 ms2926.9 ms44.8 GiB1.2 GiBM2 Ultra GPU

Note:

  1. The performance data is based on tests conducted on the Apple M2 Ultra GPU and the NVIDIA RTX 4080 GPU. Performance on other GPUs may vary due to differences in computational power, VRAM bandwidth, and other factors.
  2. The maximum context size limit for the models is 8K.
  3. The Math models were all run on the M2 Ultra GPU. Answer accuracy may be affected by factors such as computational power and quantization. For instance, anomalies in the model may occur due to insufficient computational power or excessively high GPU utilization. The results are for reference only.

 

Join Our Community

Please find more information about GPUStack at: https://gpustack.ai.

If you encounter any issues or have suggestions for GPUStack, feel free to join our Community for support from the GPUStack team and to connect with fellow users globally.

We are actively enhancing the GPUStack project and plan to introduce new features in the near future, including support for multimodal models, additional accelerators like AMD ROCm or Intel oneAPI, and more inference engines. Before getting started, we encourage you to follow and star our project on GitHub at gpustack/gpustack to receive instant notifications about all future releases. We welcome your contributions to the project.

 

 

 

Related Articles