Install and Run Hunyan 7b on Linux/ Ubuntu: An Installation Guide

Installing and running a 7-billion parameter (7B) Large Language Model—such as Mistral-7B, Llama-2-7B, or similar—on Linux/Ubuntu involves a sequence of well-defined steps covering system requirements, environment setup, Python dependencies, model download, and inference execution.
This comprehensive guide walks you through the entire process for a typical “7B” open-source model using HuggingFace’s Transformers library, including optional variations and troubleshooting for best results on a Linux or Ubuntu system.
1. Understanding the 7B Model Landscape
What is a 7B Model?
- “7B” stands for 7 billion parameters, indicating the model’s scale and performance class.
- Popular models include Meta’s Llama-2-7B, Mistral-7B, Janus-Pro-7B, and Google’s Gemma-7B.
Model Architecture
- Most are transformer-based and support tasks like text generation, summarization, and chat.
Choosing the Right Model
- Match the model to your use case (e.g., code generation, instruction-following, etc.).
- This guide applies to any HuggingFace-hosted 7B model.
1.1 What is Hunyuan 7B?
Hunyuan 7B is part of Tencent’s suite of large multimodal models. It includes both pre-trained and instruction-tuned versions tailored for natural language processing, video generation, and image synthesis tasks. It serves as the backbone for AI applications in creative, analytical, and productive domains.
Key Features
- State-of-the-art text generation and comprehension
- Multimodal capabilities (text, image, video)
- Instruction-following and prompt adaptation
2. Hardware and System Requirements
Recommended Minimum:
- RAM: 32GB (16GB may work with quantization)
- GPU: NVIDIA GPU with at least 16GB VRAM (RTX 3090/4090 ideal)
- Disk Space: 30GB+
- OS: Ubuntu 20.04+ (or Debian-based equivalent)
- Python: 3.8+
- CUDA: 11.x/12.x for GPU acceleration
You can run on CPU, but it will be significantly slower. Consider quantized models for CPU-based environments or use hosted inference.
3. Preparing the Linux Environment
3.1 Update System Packages
sudo apt update && sudo apt upgrade -y
sudo apt install python3 python3-pip python3-venv git wget -y
3.2 Create and Activate a Virtual Environment
python3 -m venv hunyan_env
source hunyan_env/bin/activate
4. Installing Python Dependencies
4.1 Install NVIDIA Drivers, CUDA, and cuDNN
Ensure GPU drivers and CUDA toolkit are installed:
nvidia-smi
nvcc --version
4.2 Install Required Python Packages
pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers huggingface_hub
Replace cu121
with the CUDA version matching your setup.
5. Model Selection and Download
Common HuggingFace model IDs:
meta-llama/Llama-2-7b-hf
mistralai/Mistral-7B-Instruct-v0.3
deepseek-ai/Janus-Pro-7B
google/gemma-7b
5.1 Download Model via HuggingFace Hub
from huggingface_hub import snapshot_download
from pathlib import Path
model_path = Path.home() / 'hunyan_models' / 'Mistral-7B-Instruct-v0.3'
model_path.mkdir(parents=True, exist_ok=True)
snapshot_download(repo_id="mistralai/Mistral-7B-Instruct-v0.3", local_dir=model_path)
Use your desired model ID in repo_id
.
6. Running Inference with Transformers
6.1 Basic Text Generation
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name_or_path = "path_to_downloaded_model"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, torch_dtype="auto", device_map="auto")
prompt = "Explain quantum computing in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
7. Instruction or Chat Interface (Optional)
7.1 Mistral 7B Python API Example
from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
tokenizer = MistralTokenizer.from_file(f"{model_path}/tokenizer.model.v3")
model = Transformer.from_folder(model_path)
completion_request = ChatCompletionRequest(messages=[UserMessage(content="Tell me a joke!")])
tokens = tokenizer.encode_chat_completion(completion_request).tokens
out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.7, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
print(tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0]))
8. Quantization and Multi-GPU Setup
8.1 Quantized Models (Low RAM/CPU)
pip install bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3",
load_in_4bit=True,
device_map="auto")
8.2 Multi-GPU Support
Set device_map='auto'
or map layers to GPUs manually. Use Accelerate
or DeepSpeed
for advanced parallelization.
9. Alternative Interfaces and Frameworks
9.1 Server or UI Wrappers
Many 7B models support:
- Web UIs (Gradio, FastChat)
- REST APIs
- CLI-based chat
9.2 Using LangChain
from langchain.llms import HuggingFacePipeline
llm = HuggingFacePipeline.from_model_id(
model_id="mistralai/Mistral-7B-Instruct-v0.3",
task="text-generation",
model_kwargs={"temperature": 0.5, "max_length": 200}
)
response = llm("Summarize Linux memory management.")
print(response)
10. Tips for Efficient 7B Model Execution
- Use
-hf
models for seamless integration. - Test PyTorch + CUDA setup before model loading.
- Set
local_files_only=True
for offline environments. - Use
htop
andnvidia-smi
for performance monitoring. - Reduce
max_new_tokens
to prevent memory overflow.
11. Troubleshooting
Issue: Model fails to load
Fix: Ensure correct paths and compatible CUDA drivers.
Issue: Out of Memory
Fix: Try quantization, smaller sequences, or use CPU fallback.
Issue: License required for model
Fix: Accept HuggingFace TOS for restricted models like Llama-2.
Issue: Want CPU-only?
Fix: Remove CUDA and set device='cpu'
.
12. Automation with Docker (Optional)
Example Dockerfile:
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu20.04
RUN apt-get update && apt-get install -y python3 python3-pip git
RUN pip3 install torch torchvision torchaudio transformers huggingface_hub
COPY ./run_model.py /app/run_model.py
WORKDIR /app
ENTRYPOINT ["python3", "run_model.py"]
13. Security Best Practices
- Never expose sensitive inputs/outputs to public endpoints.
- Keep your packages up to date.
- Use token limits and monitor abuse on API-based access.
14. Extensions and Custom Use
- Use LoRA or adapters for fine-tuning with less memory.
- Deploy with FastAPI or vLLM for production APIs.
- Explore vector databases and semantic search via LangChain or Haystack.
15. References
Conclusion
This guide serves as a modern blueprint to install, configure, and run any HuggingFace-hosted 7B LLM—such as Hunyan 7B or Mistral 7B—on Linux/Ubuntu systems using open-source tools and best practices as of today.