Hunyuan

Install and Run Hunyan 7b on Linux/ Ubuntu: An Installation Guide

John Walter

Aug 4, 2025 • 4 min read

Installing and running a 7-billion parameter (7B) Large Language Model—such as Mistral-7B, Llama-2-7B, or similar—on Linux/Ubuntu involves a sequence of well-defined steps covering system requirements, environment setup, Python dependencies, model download, and inference execution.

This comprehensive guide walks you through the entire process for a typical “7B” open-source model using HuggingFace’s Transformers library, including optional variations and troubleshooting for best results on a Linux or Ubuntu system.

1. Understanding the 7B Model Landscape

What is a 7B Model?

“7B” stands for 7 billion parameters, indicating the model’s scale and performance class.
Popular models include Meta’s Llama-2-7B, Mistral-7B, Janus-Pro-7B, and Google’s Gemma-7B.

Model's Architecture

Most are transformer-based and support tasks like text generation, summarization, and chat.

Choosing the Right Model

Match the model to your use case (e.g., code generation, instruction-following, etc.).
This guide applies to any HuggingFace-hosted 7B model.

1.1 What is Hunyuan 7B?

Hunyuan 7B is part of Tencent’s suite of large multimodal models. It includes both pre-trained and instruction-tuned versions tailored for natural language processing, video generation, and image synthesis tasks. It serves as the backbone for AI applications in creative, analytical, and productive domains.

Key Features

State-of-the-art text generation and comprehension
Multimodal capabilities (text, image, video)
Instruction-following and prompt adaptation

2. Hardware and System Requirements

Recommended Minimum:

RAM: 32GB (16GB may work with quantization)
GPU: NVIDIA GPU with at least 16GB VRAM (RTX 3090/4090 ideal)
Disk Space: 30GB+
OS: Ubuntu 20.04+ (or Debian-based equivalent)
Python: 3.8+
CUDA: 11.x/12.x for GPU acceleration

You can run on CPU, but it will be significantly slower. Consider quantized models for CPU-based environments or use hosted inference.

3. Preparing the Linux Environment

3.1 Update System Packages

sudo apt update && sudo apt upgrade -y
sudo apt install python3 python3-pip python3-venv git wget -y

3.2 Create and Activate a Virtual Environment

python3 -m venv hunyan_env
source hunyan_env/bin/activate

4. Installing Python Dependencies

4.1 Install NVIDIA Drivers, CUDA, and cuDNN

Ensure GPU drivers and CUDA toolkit are installed:

nvidia-smi
nvcc --version

4.2 Install Required Python Packages

pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers huggingface_hub

Replace cu121 with the CUDA version matching your setup.

5. Model Selection and Download

Common HuggingFace model IDs:

meta-llama/Llama-2-7b-hf
mistralai/Mistral-7B-Instruct-v0.3
deepseek-ai/Janus-Pro-7B
google/gemma-7b

5.1 Download Model via HuggingFace Hub

from huggingface_hub import snapshot_download
from pathlib import Path

model_path = Path.home() / 'hunyan_models' / 'Mistral-7B-Instruct-v0.3'
model_path.mkdir(parents=True, exist_ok=True)

snapshot_download(repo_id="mistralai/Mistral-7B-Instruct-v0.3", local_dir=model_path)

Use your desired model ID in repo_id.

6. Running Inference with Transformers

6.1 Basic Text Generation

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name_or_path = "path_to_downloaded_model"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, torch_dtype="auto", device_map="auto")

prompt = "Explain quantum computing in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

7. Instruction or Chat Interface (Optional)

7.1 Mistral 7B Python API Example

from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest

tokenizer = MistralTokenizer.from_file(f"{model_path}/tokenizer.model.v3")
model = Transformer.from_folder(model_path)
completion_request = ChatCompletionRequest(messages=[UserMessage(content="Tell me a joke!")])
tokens = tokenizer.encode_chat_completion(completion_request).tokens
out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.7, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
print(tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0]))

8. Quantization and Multi-GPU Setup

8.1 Quantized Models (Low RAM/CPU)

pip install bitsandbytes

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3",
                                              load_in_4bit=True,
                                              device_map="auto")

8.2 Multi-GPU Support

Set device_map='auto' or map layers to GPUs manually. Use Accelerate or DeepSpeed for advanced parallelization.

9. Alternative Interfaces and Frameworks

9.1 Server or UI Wrappers

Many 7B models support:

Web UIs (Gradio, FastChat)
REST APIs
CLI-based chat

9.2 Using LangChain

from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id="mistralai/Mistral-7B-Instruct-v0.3", 
    task="text-generation",
    model_kwargs={"temperature": 0.5, "max_length": 200}
)
response = llm("Summarize Linux memory management.")
print(response)

10. Tips for Efficient 7B Model Execution

Use -hf models for seamless integration.
Test PyTorch + CUDA setup before model loading.
Set local_files_only=True for offline environments.
Use htop and nvidia-smi for performance monitoring.
Reduce max_new_tokens to prevent memory overflow.

11. Troubleshooting

Issue: Model fails to load
Fix: Ensure correct paths and compatible CUDA drivers.

Issue: Out of Memory
Fix: Try quantization, smaller sequences, or use CPU fallback.

Issue: License required for model
Fix: Accept HuggingFace TOS for restricted models like Llama-2.

Issue: Want CPU-only?
Fix: Remove CUDA and set device='cpu'.

12. Automation with Docker (Optional)

Example Dockerfile:

FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu20.04
RUN apt-get update && apt-get install -y python3 python3-pip git
RUN pip3 install torch torchvision torchaudio transformers huggingface_hub

COPY ./run_model.py /app/run_model.py
WORKDIR /app
ENTRYPOINT ["python3", "run_model.py"]

13. Security Best Practices

Never expose sensitive inputs/outputs to public endpoints.
Keep your packages up to date.
Use token limits and monitor abuse on API-based access.

14. Extensions and Custom Use

Use LoRA or adapters for fine-tuning with less memory.
Deploy with FastAPI or vLLM for production APIs.
Explore vector databases and semantic search via LangChain or Haystack.

15. References

Conclusion

This guide serves as a modern blueprint to install, configure, and run any HuggingFace-hosted 7B LLM—such as Hunyan 7B or Mistral 7B—on Linux/Ubuntu systems using open-source tools and best practices as of today.