microsoft

Run Microsoft Phi 4 on Mac: Installation Guide

Anas Mohammad

May 1, 2025 • 4 min read

Microsoft's Phi-4 models represent a breakthrough in efficient language model design, offering advanced natural language capabilities while maintaining hardware accessibility.

This guide covers all technical aspects of running Phi-4 Mini and Phi-4 Noesis variants on macOS, including architectural considerations, installation procedures, optimization strategies, and practical applications.

Model Architecture and Variants

Phi-4 Mini Specifications

Parameters: 3.8 billion
Architecture: Dense decoder-only Transformer
Capabilities:
- Complex reasoning
- Mathematical computation
- Code generation
- Instruction following

Phi-4 Noesis Features

Parameters: 14B (as demonstrated in M3 Pro benchmarks)
Optimizations:
- MPS (Metal Performance Shaders) acceleration
- 16k token context window
- GPTQ quantization support

System Requirements

Component	Minimum Specs	Recommended Specs
OS Version	macOS 12.3+3	macOS 14+
Processor	Intel Core i7	M1/M2/M3 Silicon3 5
RAM	16GB3	32GB3
Storage	40GB free3	SSD with 100GB free
Python	3.9+1	3.10+3

Environment Setup

Core Dependencies

bash# Install Homebrew package manager /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"[3] # Install Python 3.10 brew install [email protected][3] # Verify installation python3 --version # Should show 3.10.x[3]

Virtual Environment Configuration

bashpython3.10 -m venv phi4-env
source phi4-env/bin/activate[3]

Installation Methods

Method 1: Using Private LLM App

Update to v1.9.6+
Download model through app interface
Configure with GPTQ quantization

Method 2: Manual Installation

bash# Install PyTorch with MPS support pip3 install --pre torch torchvision torchaudio --extra-index-url <https://download.pytorch.org/whl/nightly/cpu>[3] # Install transformers library pip install transformers sentencepiece accelerate[3]

Method 3

Step-by-Step Installation Guide

Prerequisites:
- macOS Version: macOS 12.3+ (Monterey or newer)
- Chip: M1/M2/M3 Apple Silicon or Intel Core i7+
- RAM: 16GB (32GB recommended)
- Storage: 40GB free space
- Python: 3.8 or higher
Install Dependencies:
Load the Phi-4 Model:
Optimize Performance:
Create a User-Friendly Interface:

Use Gradio to create a simple web interface:PythonCopy

import gradio as gr

def generate_text(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to("mps")
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=100)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

interface = gr.Interface(
    fn=generate_text,
    inputs=gr.Textbox(label="Prompt"),
    outputs=gr.Textbox(label="Generated Text"),
    title="Phi-4 Text Generator",
    description="Generate text using the Phi-4 model."
)
interface.launch()

Limit batch size to prevent out-of-memory errors:PythonCopy

max_batch_size = 2

Use 4-bit quantization to reduce VRAM usage:PythonCopy

model = model.quantize(4)

Use the Hugging Face Transformers library to load the Phi-4 model:PythonCopy

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "microsoft/phi-4"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

Install PyTorch and Hugging Face Transformers:bashCopy

pip install torch transformers

Install Python via Homebrew:bashCopy

brew install [email protected]

Model Loading and Inference

Phi-4 Mini Implementation

pythonfrom transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("phi4-mini") tokenizer = AutoTokenizer.from_pretrained("phi4-mini") def generate(prompt): inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=100) return tokenizer.decode(outputs[0], skip_special_tokens=True)[1]

Phi-4 Noesis Configuration

pythonmodel = AutoModelForCausalLM.from_pretrained( "dimsavva/phi4-noesis", trust_remote_code=True, device_map="auto" # Auto-detects M1/M2 GPU[3] )

Performance Optimization

Apple Silicon Enhancements

Metal Performance Shaders: 8x speedup vs CPU
CoreML Conversion:

pythonimport coremltools as ct
coreml_model = ct.convert(model) coreml_model.save("phi4-mini.mlpackage")[1]

Quantization Techniques

GPTQ 4-bit quantization5
16-bit floating point precision3
Model parallelism via torch.distributed

Advanced Deployment

REST API Implementation

pythonfrom flask import Flask, request, jsonify

app = Flask(__name__) @app.route("/generate", methods=["POST"]) def generate(): data = request.json
prompt = data.get("prompt", "") # ... (add model inference code) return jsonify({"response": generated_text})[1]

Batch Processing Setup

pythonfrom torch.utils.data import Dataset, DataLoader

class Phi4Dataset(Dataset): def __init__(self, prompts): self.prompts = prompts

def __len__(self): return len(self.prompts) def __getitem__(self, idx): return self.prompts[idx][3]

Troubleshooting Guide

Issue	Solution
CUDA Out of Memory	Reduce batch size, enable gradient checkpointing3
MPS Backend Errors	Update to PyTorch 2.0+, verify Metal support3
Tokenizer Mismatch	Ensure transformers library version ≥4.28.01
Slow Inference	Enable `use_cache=True`, optimize with ONNX Runtime3

Practical Applications

Use Case Examples

Code Generation: Implement CI/CD pipeline scripts
Research Analysis: Process academic papers
Educational Tools: Create interactive learning modules
Content Creation: Generate technical documentation

Benchmark Results

M3 Pro 36GB MacBook: 14B model at 15-20 tokens/sec
M1 Max: 3.8B model at 30+ tokens/sec
Intel i7: 3-5 tokens/sec (CPU-only mode)

Live Examples

Example 1: Mathematical Reasoning:

Prompt: "Solve for x in the equation 3x + 5 = 20."
Output: "To solve for x in the equation 3x + 5 = 20, first subtract 5 from both sides to get 3x = 15. Then divide both sides by 3 to find x = 5."

Example 2: Multimodal Task:

Prompt: "Describe the image of a cat sitting on a windowsill."
Output: "The image shows a cat sitting gracefully on a windowsill, with its tail curled around its paws. The cat is looking out the window, its eyes reflecting the sunlight. The windowsill is wooden, with a few potted plants nearby."

Code:PythonCopy

prompt = "Describe the image of a cat sitting on a windowsill."
print(generate_text(prompt))