llama 4

How to Run LLama 4 on Ubuntu: A Comprehensive Guide

Anas Mohammad

Apr 8, 2025 • 3 min read

Running LLama 4 locally on Ubuntu provides an exceptional opportunity to harness advanced artificial intelligence while keeping your data secure and lowering operational costs.

This guide walks you through every step—from setting up your system and installing necessary software to fine-tuning your model and troubleshooting common issues.

Prerequisites

Before you begin the installation process, verify that your system meets the following requirements:

Hardware Requirements

GPU: Ensure you have at least one high-end GPU with sufficient VRAM. Basic tasks require at least 16GB, while optimal performance is achieved with 96GB or more.
RAM: Minimum of 32GB is required, though 64GB or more is preferred for handling large-scale models.
Storage: Sufficient SSD space (tens of GBs) is necessary for storing model weights and related data.

Software Requirements

Operating System: Ubuntu 22.04 LTS or later is recommended for its stability and compatibility.
CUDA Toolkit: Keep your NVIDIA drivers and CUDA toolkit updated (CUDA version 12.1 is often preferred).
Python: Use Python 3.8+ and consider a virtual environment for managing dependencies effectively.

Setting Up Your Environment

System Updates

Update your system to ensure that all packages are current. Run the following commands:

sudo apt update && sudo apt upgrade -y
sudo reboot

Install NVIDIA Drivers

For systems with an NVIDIA GPU, install the proprietary drivers:

sudo apt install nvidia-driver-525
sudo reboot

After rebooting, verify the driver installation by running:

lsmod | grep nvidia

Install CUDA Toolkit

Download and install the CUDA toolkit with the following commands:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cuda-12-1

Installing LLama 4

Clone the Repository

Download the official LLama.cpp repository from GitHub:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Build the Application

Install the necessary build tools and compile the application:

sudo apt install build-essential cmake
cmake -B build
cmake --build build --config Release

Download Model Weights

Obtain the LLama 4 model weights from trusted sources such as Hugging Face. For example:

wget https://huggingface.co/Sosaka/Alpaca-native-4bit-ggml/resolve/main/ggml-alpaca-7b-q4.bin -P models/

Ensure that the downloaded model is placed in the models directory.

Running LLama 4

Launching the Application

Navigate to the build directory and launch the model:

cd build/bin/
./llama-cli -m models/ggml-alpaca-7b-q4.bin

Optimizing RAM Usage

In case your system has limited memory, consider these adjustments:

Load Entire Model into Memory
Use the -mlock parameter (if supported) to load the entire model into RAM, which may enhance performance in memory-constrained environments.

Set ulimit to Unlimited
Check the current memory lock limit:

ulimit -a | grep memlock

Then, edit /etc/security/limits.conf and /etc/pam.d/common-session to increase the limits.

Fine-Tuning LLama 4

Customize your LLama 4 setup by fine-tuning it for specific applications:

Initiate the Fine-Tuning Process
Load your dataset and use the training scripts provided in the LLama.cpp repository to tailor the model to your specific needs.

Install Dependencies for Fine-Tuning
Install PyTorch along with the Transformers and Accelerate libraries:

pip install torch torchvision transformers accelerate cuda-python

Troubleshooting Common Issues

Error: "LLama Runner Process No Longer Running"
This error may indicate missing CUDA libraries or issues with symbolic links. Verify that the LD_LIBRARY_PATH environment variable is correctly set.
Low RAM Utilization on Ubuntu
If the application is using only a fraction of available memory, confirm that system limits set by ulimit are appropriately configured. Adjust the configuration files if necessary.

Advanced Features

Running LLama 4 with a Web Interface

For more user-friendly experience, set up a web interface to interact with LLama 4:

npm run dev -- PUBLIC_API_BASE_URL='http://:11434/api'

Then, access the interface via your browser at http://localhost:11434.

Using Multiple GPUs

For handling larger models like LLama 70B, configure multi-GPU support using PyTorch’s distributed training capabilities. This allows for improved performance and scalability across multiple GPUs.

Conclusion

Running LLama 4 on Ubuntu empowers both AI enthusiasts and professionals to experiment with cutting-edge AI technologies while retaining full control over their data and infrastructure.

By following this guide, you can efficiently set up, optimize, and troubleshoot your LLama 4 installation, as well as explore advanced functionalities such as fine-tuning and web-based interfaces.