GGUF Parser: A Tool for Estimating LLM Resource Requirements

GGUF Parser

Introducing GGUF Parser

GGUF, a highly efficient file format, is designed for storing models for inference with GGML and executors based on GGML. It's a binary format that ensures fast loading and saving of models. Models, typically developed using PyTorch or another framework, undergo a conversion process to be compatible with GGUF for use in GGML.

GGUF Parser provides some functions to parse the GGUF file in Go for the following purposes:

Read metadata from the GGUF file without downloading the whole model remotely.
Estimate the model resource requirements.

Estimating the RAM/VRAM required for running models is a crucial step when deploying them. This estimation can determine the appropriate model size and select the suitable quantization method.

For more details about GGUF Parser, visit:

GitHub repo: https://github.com/gpustack/gguf-parser-go

Getting Started with GGUF Parser

Download the gguf-parser binary from the Releases on https://github.com/gpustack/gguf-parser-go. Move the GGUF Parser binary to /usr/local/bin and grant it execution permissions (The following commands are for MacOS):


xxxxxxxxxx
 mv ~/Downloads/gguf-parser-darwin-arm64 /usr/local/bin/gguf-parser
 chmod +x /usr/local/bin/gguf-parser

Run the following command to view the runtime parameters for GGUF Parser (MacOS requires allowing gguf-parser to run in Privacy & Security settings):


xxxxxxxxxx
 gguf-parser -h

Common parameters:

-path: Specifies the local model file path to load.
-hf-repo: Specifies the Hugging Face model repository to load.
-hf-file: Used with -hf-repo, specifies the GGUF model file name within the corresponding Hugging Face repository.
-gpu-layers: Specifies how many layers of the model are offloaded to the GPU. The more layers you offload, the faster the inference speed. The number of layers in the model can be determined from the LAYERS section of the ARCHITECTURE part in the output after execution.
-ctx-size: Specifies the model's context size, constrained by the model. After execution, you can view the upper limit in the Max Context Len section of the ARCHITECTURE part.
-url: Specifies the URL path of the remote model file to load.

Run GGUF Parser and pay attention to the results in the ESTIMATE section.


xxxxxxxxxx
gguf-parser --hf-repo rubra-ai/Meta-Llama-3-8B-Instruct-GGUF -hf-file rubra-meta-llama-3-8b-instruct.Q4_K_M.gguf

Based on the estimated results, under Apple's Unified Memory Architecture (UMA), the model will use:

84.16 MiB of RAM and 1.12 GiB of VRAM, totaling 1.21 GiB of memory.

Under a non-UMA architecture, the model will use:

234.16 MiB of RAM and 6.41 GiB of VRAM.

By default, all layers are offloaded to the GPU for acceleration, which maximizes GPU usage but may put pressure on the GPU. You can also use the -gpu-layers parameter to specify how many layers to offload to the GPU. For example, if 20 layers of the model are offloaded to the GPU in a non-UMA architecture, the model will use:

754.16 MiB of RAM and 3.97 GiB of VRAM.


xxxxxxxxxx
gguf-parser --hf-repo rubra-ai/Meta-Llama-3-8B-Instruct-GGUF -hf-file rubra-meta-llama-3-8b-instruct.Q4_K_M.gguf -gpu-layers 20

Changing the model's context size affects memory usage. For example, decreasing the -ctx-size parameter from the default 8192 to 2048 will reduce memory usage by 878.48 MiB (1.21 GiB - 360.16 MiB) under Apple's UMA.


xxxxxxxxxx
gguf-parser --hf-repo rubra-ai/Meta-Llama-3-8B-Instruct-GGUF -hf-file rubra-meta-llama-3-8b-instruct.Q4_K_M.gguf -ctx-size 2048

If you want to learn more about how it works, you can join our Community to talk to our team.

GPUStack and GGUF Parser

GPUStack estimates LLM resource requirements using GGUF Parser. It is an open-source GPU cluster manager for running large language models (LLMs) that automatically schedules the model to run on machines with appropriate resources.

GPUStack allows you to create a unified cluster from any brand of GPUs in Apple Macs, Windows PCs, and Linux servers. Administrators can deploy LLMs from popular repositories such as Hugging Face. Developers can then access LLMs just as easily as accessing public LLM services from vendors like OpenAI or Microsoft Azure.

If you are interested in GPUStack, visit the following links to see more information:

Introducing GPUStack: https://gpustack.ai/introducing-gpustack

User guide: https://docs.gpustack.ai

About Us

GPUStack and GGUF Parser is brought to you by Seal, Inc., a team dedicated to enabling AI access for all. Our mission is to enable enterprises to use AI to conduct their business, and GPUStack is a significant step towards achieving that goal.

Quickly build your own LLMaaS platform with GPUStack! Start experiencing the ease of creating GPU clusters locally, running and using LLMs, and integrating them into your applications.

Try GPUStack

GGUF Parser: A Tool for Estimating LLM Resource Requirements

Introducing GGUF Parser

Getting Started with GGUF Parser

GPUStack and GGUF Parser

About Us

Related Articles

Resources

Company

Get our newsletter