microsoft

Run Microsoft OmniParser V2 on Ubuntu : Step by Step Installation Guide

Anas Mohammad

Feb 19, 2025 • 3 min read

Microsoft's OmniParser V2 is a powerful tool designed to interpret user interface (UI) screenshots and convert them into a structured format. This enhances the ability of Large Language Models (LLMs) to interact with graphical user interfaces (GUIs), facilitating the creation of autonomous GUI agents that can effectively interact with on-screen components.

Key Features

Screen Parsing: Converts UI screenshots into structured, easy-to-understand elements.
Enhanced GUI Interaction: Enables autonomous GUI agents to interact more effectively with on-screen components.
Integration with LLMs: Compatible with OpenAI (GPT-4o/o1/o3-mini), DeepSeek (R1), and Qwen (2.5VL).
Support for OmniTool: Integrates with OmniTool, a Windows 11 virtual machine that allows for fully autonomous agentic actions.

Requirements for Installation

Before installing OmniParser V2, ensure your system meets the following requirements:

Operating System: Ubuntu
Python: Version 3.12
Conda: Package and environment management system
Git: For cloning the OmniParser repository
Hugging Face CLI: For downloading model checkpoints

Step-by-Step Installation Guide

Follow these steps to install and set up Microsoft OmniParser V2 on Ubuntu:

1. Clone the OmniParser Repository

git clone https://github.com/microsoft/OmniParser
cd OmniParser

2. Create a Conda Environment

conda create -n "omni" python==3.12
conda activate omni

3. Install Dependencies

pip install -r requirements.txt

4. Download Model Checkpoints

for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
mv weights/icon_caption weights/icon_caption_florence

5. Run the Gradio Demo (Optional)

python gradio_demo.py

OmniTool Integration

OmniTool is a Windows 11 virtual machine that integrates OmniParser with an LLM (such as GPT-4o) to enable fully autonomous agentic actions.

Steps to Set Up OmniTool:

Set up a Windows 11 Virtual Machine
- Install VirtualBox or VMware.
- Download a Windows 11 ISO image from Microsoft.
- Create a new virtual machine and install Windows 11.
Install OmniParser and Dependencies
- Follow the installation steps mentioned above within the Windows 11 virtual machine.
Configure OmniTool
- Refer to the OmniTool documentation for configuration steps.
- Set up the connection between OmniParser and the LLM.
Test OmniTool
- Run test scripts to evaluate OmniTool’s performance.
- Monitor agentic actions and adjust configurations as needed.

Usage and Examples

After installing OmniParser V2, you can use it to parse UI screenshots and extract structured information.

Steps:

Open the demo.ipynb Notebook:
- Navigate to the OmniParser directory.
- Open the demo.ipynb file using Jupyter Notebook.
Run the Examples:
- Follow instructions in the notebook to load a UI screenshot and parse it.
- Examine the output to understand how OmniParser structures the UI elements.

Troubleshooting

Dependency Issues: Ensure all packages from requirements.txt are installed correctly.
Model Checkpoint Errors: Verify that the files are downloaded and placed in the weights directory with the correct naming.
Gradio Demo Problems: Ensure all dependencies are installed and that there are no conflicting processes using the same port.

Advanced Configuration

For advanced users, OmniParser V2 offers several configuration options:

Custom Model Weights: Train and integrate your own model weights.
Adjusting Parameters: Modify settings in train_args.yaml.
Integrating with Custom LLMs: Connect OmniParser with various LLMs per your requirements.

Use Cases for OmniParser V2

OmniParser V2 has various applications, including:

Automated Testing: Automating GUI application testing.
Robotic Process Automation (RPA): Automating repetitive GUI interactions.
Accessibility Enhancements: Providing structured information for users with disabilities.
Virtual Assistants: Enabling AI-driven virtual assistants to interact with UI elements.

Conclusion

Microsoft OmniParser V2 is a cutting-edge tool for parsing UI screenshots and extracting structured information, enabling the development of autonomous GUI agents. Following this guide will help you successfully install and run OmniParser V2 on Ubuntu. Its integration with OmniTool and compatibility with multiple LLMs make it a powerful asset for GUI automation and AI-driven applications.